90% Cost Reduction in AI — But It Had Nothing to Do With the Model
In the world of Artificial Intelligence, the constant pursuit of efficiency often leads us down the path of model optimisation. We hear about new models, faster inference times, and the ever-present quest for cheaper, more powerful AI. But what if the biggest breakthroughs in AI cost and performance weren't about the models themselves, but about how we orchestrate them and the underlying architecture we build upon?
Recently, a remarkable story emerged from Notion, a company known for its powerful productivity and note-taking platform. They reported achieving a staggering 90% cost reduction and an 85% latency reduction, all while enabling over 30 concurrent agent tasks. The truly eye-opening detail? This monumental improvement had virtually nothing to do with switching to a cheaper or faster AI model. Instead, the secret sauce lay in their innovative approach to prompt caching and orchestration.
This revelation is a game-changer, especially for the UK's small and medium-sized enterprises (SMEs), freelancers, and growing teams who are increasingly looking to AI to streamline their operations. Many businesses are hesitant to adopt AI due to perceived complexity and high costs. This Notion example offers a crucial insight: the real optimisation layer for AI isn't necessarily the AI model itself, but the intelligent architecture and workflow design that surrounds it. At WAi Forward, we've been championing this very philosophy with our Object-Oriented AI system, RunWAi.
This blog post will delve into why this distinction is so critical, exploring the limitations of solely focusing on model optimisation and highlighting the transformative power of smart architectural design. We'll unpack how techniques like prompt caching and intelligent orchestration can unlock unprecedented efficiency, making powerful AI accessible and affordable for businesses of all sizes.
The Siren Song of Model Optimisation: A Misleading Melody
The AI landscape is a whirlwind of progress, with new models emerging at a breathtaking pace. From the behemoths like GPT-4 to a growing array of open-source alternatives, the sheer choice can be overwhelming. It's natural for businesses, when considering AI adoption, to gravitate towards the idea of finding the "cheapest" or "fastest" model as the primary route to cost savings and performance gains. This is where the industry often gets it wrong.
Think of an AI model as a highly skilled but specialised artisan. You can hire a faster artisan, or one who charges less per hour. However, if that artisan is constantly being asked to perform the same, repetitive tasks that could be batched, pre-prepared, or handled by a simpler tool, their individual speed or cost becomes a secondary concern. The real bottleneck is often the workflow, the communication, and the surrounding processes.
For many SMEs, the current AI paradigm often feels like this: you have a brilliant AI model, and you're feeding it raw, unstructured requests, one by one. Each request requires the model to process from scratch, understand the context, generate a response, and then move on to the next. This is inherently inefficient. It's like asking a master chef to chop every single onion for every single dish, from scratch, every single time, instead of having a prep chef chop a large batch of onions at the beginning of the day.
The costs associated with this approach can escalate rapidly. API calls to powerful models, even if individually cheap, can rack up significantly when performed millions of times. Latency also becomes a major issue. Waiting for individual AI responses to complete before the next step in a workflow can lead to frustratingly slow user experiences and inefficient operational processes. Businesses then face a difficult choice: either accept these high costs and slow speeds, or invest heavily in finding and fine-tuning ever-more-expensive, cutting-edge models, hoping for marginal improvements.
This focus on the model as the sole optimisation point creates a few key problems:
- Escalating Costs: As usage grows, the cost of API calls becomes a significant operational expense. Businesses might find themselves paying premium prices for every single interaction, without a clear path to reducing this base cost.
- Performance Bottlenecks: Latency isn't just about model inference speed. It's about the entire round trip: sending the prompt, the model processing, and the response returning. If the workflow requires multiple sequential AI calls, even a fast model can lead to a slow overall process.
- Vendor Lock-in and Complexity: Constantly chasing the latest model can lead to vendor lock-in and requires continuous effort in understanding new APIs, potential fine-tuning needs, and integration challenges. This adds significant technical overhead.
- Ignoring the "Last Mile" Problem: Even the most advanced AI model is useless if it's not integrated into a practical, efficient workflow. The "last mile" of getting AI to deliver tangible business value is often hindered by poor architecture, not a weak model.
- Misplaced Investment: Businesses might spend considerable time and resources researching and experimenting with different models, when the real gains could be unlocked by optimising the surrounding infrastructure and processes.
The Notion example powerfully illustrates that this focus on model optimisation is often a red herring. The true innovation lies in how we architect AI systems to be more intelligent, efficient, and cost-effective, irrespective of the specific model powering them.
The Power of Prompt Caching and Orchestration: Rethinking the AI Workflow
So, if not the model itself, what enabled Notion to achieve such dramatic improvements? The answer lies in two interconnected concepts: prompt caching and intelligent orchestration. These techniques shift the focus from individual AI calls to the broader system and workflow design.
Prompt Caching: Reusing Intelligence, Not Recomputing It
Imagine you're writing an email to a client. You might have a standard opening like, "Dear [Client Name], I hope this email finds you well." If you're sending this to 100 different clients, would you retype that exact phrase 100 times? Of course not. You'd use a template or a placeholder. Prompt caching applies this same principle to AI.
In essence, prompt caching involves storing and reusing the results of previous AI queries for identical or highly similar prompts. When a request comes in that has been seen before, the system doesn't need to send it to the AI model again. Instead, it retrieves the cached response, which is almost instantaneous and incurs zero additional cost.
This is incredibly powerful for several reasons:
- Eliminates Redundant Computations: Many AI tasks, especially in business workflows, involve repetitive requests. For example, generating a standard product description, summarising a common type of document, or drafting a follow-up email based on a template. Caching these responses dramatically reduces the number of calls to the AI model.
- Drastic Cost Reduction: Since cached responses don't incur API costs, the savings can be enormous, especially at scale. If 70% of your AI queries can be served from a cache, you've just cut your AI operational costs by 70%.
- Near-Instantaneous Responses: Retrieving data from a cache is orders of magnitude faster than sending a request to an AI model and waiting for its response. This leads to significant latency reduction, improving user experience and workflow speed.
- Enables Complex Workflows: By removing the latency and cost burden of repeated AI calls, caching makes it feasible to build more complex, multi-step AI-driven workflows that would otherwise be prohibitively expensive or slow.
For WAi Forward, prompt caching is a cornerstone of our RunWAi system. We recognise that many business processes involve predictable, repeatable interactions. By intelligently caching responses for common tasks—whether it's generating a draft social media post based on a template, summarising a standard meeting agenda, or creating a preliminary invoice from structured data—we ensure that our clients' AI usage is as efficient as possible. This means less waiting, lower bills, and more predictable outcomes.
Intelligent Orchestration: The Conductor of the AI Orchestra
Prompt caching is one piece of the puzzle. The other is intelligent orchestration – the art and science of managing, directing, and coordinating multiple AI agents and processes to achieve a larger goal. This is where the true architectural innovation lies, moving beyond single, isolated AI calls to sophisticated, interconnected workflows.
Orchestration involves:
- Workflow Design: Mapping out the sequence of steps, including human review points, AI tasks, and data transformations.
- Agent Management: Deciding which AI model or tool is best suited for a particular sub-task, and how they should interact.
- Conditional Logic: Implementing rules that determine the next step based on the output of previous steps.
- Data Flow Management: Ensuring that information is correctly passed between different AI agents and human users.
- Error Handling and Resilience: Building systems that can gracefully handle failures and recover.
In the context of Notion's success, their orchestration layer likely involved intelligently routing requests. Instead of every request going to a general-purpose, expensive LLM, their system might have first checked if a task could be handled by a simpler, more specialised tool, or if the answer was already cached. For tasks that *did* require a powerful LLM, the orchestrator would ensure that the prompt was optimally formatted, and that the response was then processed and stored for future use.
This is precisely how WAi Forward's RunWAi engine operates. We don't just offer generic AI chat tools. We build AI systems that treat work as structured objects—Leads, Tasks, Posts, Invoices. Each object has a clear lifecycle and predictable interactions. Our orchestration layer understands these objects and their lifecycles, allowing us to build highly efficient, automated workflows:
- Object-Oriented AI: When a new Lead comes in (an object), our Lead the WAi platform might orchestrate a series of AI tasks: summarising the inquiry, drafting a personalised follow-up email, and scheduling a task for a sales representative—all within a defined workflow.
- Hybrid Workflows: AI drafts, suggests, and assists. Humans review, approve, and guide. Our orchestration ensures that the AI's output is presented to the human user in a clear, actionable format, and that their feedback is efficiently incorporated back into the system.
- Unified Ecosystem: PathWAI (productivity), Lead the WAi (marketing/sales), and PAI it Forward (finance) all leverage the RunWAi engine. This means that the orchestration logic and caching strategies are consistent across your business, creating a seamless, intelligent ecosystem. For example, a completed task in PathWAI could automatically trigger a follow-up email draft in Lead the WAi, which then might prompt the creation of an invoice object in PAI it Forward.
- Predictable Outcomes: Because our AI is object-oriented and orchestrated, the outcomes are far more predictable than with generic chat tools. We're not just asking an AI to "write something"; we're instructing it to perform a specific action within a defined business process.
The combination of prompt caching and intelligent orchestration creates a synergistic effect. Caching reduces the load on the AI models, making them more available for truly novel tasks. Orchestration ensures that AI is applied at the right time, in the right way, and that its outputs are leveraged effectively. This is how you achieve dramatic cost and latency reductions without necessarily changing the underlying AI model.
Beyond the Hype: Practical AI for Real Businesses
The implications of this architectural shift are profound, especially for the UK SMEs, freelancers, and growing teams that WAi Forward serves. The "AI revolution" can feel distant and expensive, but the Notion example shows that the path to practical, affordable AI is through smart design, not just cutting-edge models.
Here’s how this architectural focus translates into tangible benefits for your business:
1. Making AI Accessible and Affordable
The 90% cost reduction achieved by Notion is not an isolated fluke; it's a demonstration of what's possible when you optimise the *system*, not just the individual components. For a small business, this means AI can move from being a "nice-to-have" luxury to a core operational tool. Instead of facing prohibitive monthly bills for