I write about agentic AI so often that my Weiner (dog), Slob, has developed what I can only describe as a professional skepticism toward my keyboard. Every time I sit down to produce another piece on autonomous agents, orchestration layers, or the seventeen-layer governance stack no one is actually going to build, he looks at me with the specific variety of disappointment that only a Weiner (and a CFO) can generate. He is, as far as I can tell, more calibrated about this industry than most of the influencers covering it.
Anyways. . .
This particular piece was supposed to be quick.
A state-of-the-market survey, maybe two thousand words, something I could write between deployments. But that was six weeks ago. The reason it took longer is the same as everything in this field takes longer, which is that the more carefully you look at the actual numbers, well, the more the proud pronouncements of the vendor ecosystem start to dissolve into something more ambiguous. The press has noticed this as well. You will have read lots of posts about token costs are rising through the roof, and that the open-source models are eating the proprietary ones alive and offering roughly the same capabilities, and somewhere out there is a CFO, at a company I will call Company X because I respect their legal department, who added up a single quarter of Claude Code token expenditure and arrived at something in the neighborhood of half a billion dollars. Yes, in a quarter, and the ROI, at last report, still remains unclear. I want to be careful not to throw stones, because I live in a glass house with a very expensive electricity bill (sic, please setup a gofundme), but I will note that this is the kind of number that makes even enthusiasts pause.
I have been running agentic AI programs in production for roughly two years now, and the experience has a close resemblance to a religious procession they hold in Echternach, a small town in Luxembourg that most people outside the BeNeLux have never heard of, where participants move forward two steps and then back one step. This way they’re covering ground slowly in a kind of shuffle that looks like a profound ceremony or maybe even collective confusion. I am genuinely uncertain which category my work falls into on any given day of the week and every deployment teaches me something that forces us to partially undo the deployment before it. Every governance framework you build turns out to require a meta-governance framework to manage the governance framework and every promising use case that looked clean in the proof-of-concept phase produces a hairball of edge cases and organizational resistance that was invisible in the demo environment.
So yeah, I set out to write a field report on the state of agentic ROI in Q2 2026, which is fragile and concentrated, but real in some areas, and overall wildly overstated everywhere else. If you want to understand why both of those things are simultaneously true, you need to understand something about how agents actually fail in enterprise environments, because the failure modes are the same three problems dressed in different vendor logos.
More rants after the messages
Visit the ATLAS website. It’s free. I built it for myself. And now you can use it too. Link in comments.
- Connect with me on Linkedin 🙏
- Subscribe to TechTonic Shifts to get your daily dose of tech 📰
- Please comment, like or clap the article. Whatever you fancy.
Three ways to lose money professionally
This chapter has no intro. We’re just gonna jump in headfirst.
Let’s start with scope.
For readers who are coming to this fresh, agentic AI is the practice of deploying AI systems that autonomously execute multi-step processes across real systems and who are making decisions, and hopefully escalating to humans only when something falls outside their operating parameters. A chatbot takes input and produces output. An agent takes a goal and produces outcomes, and interacts with the world in between. Now, the distinction matters enormously for economics, because an agent that actually closes the loop on a business process is worth something different from a system that simply summarizes documents and waits to be told what to do next.
Everyone in AI now seems to have their of definition of an “agent” and vendors are making this worse even. Ask ten of ’em and you’ll get eleven PowerPoint slides, three maturity models, and a chatbot wearing a fake moustache calling itself ‘autonomous’ (read it with a French accent). That’s why I created a deck in which I describe autonomy as a spectrum including definitions (link below this chapter).
The problem I want to start with, is that an agent that is ploughing through a long and mostly complex, branched process is accumulating token costs next to making compound errors, and they occasionally go so thoroughly off-script that the only available recovery option is to restart the whole darn chain from scratch.
You have to know that a standard agentic system invokes between ten and twenty model calls per transaction, and each of those calls re-submits the full historical context that it accumulated up to that point. The cumulative input tokens processed over a multi-turn sequence grow quadratically (sic!), and not linearly, and that is because every step must process everything that came before it.
Yup.
Stanford’s Digital Economy Lab estimates that this redundant re-processing of historical context accounts for roughly 62% of total agent inference expenditure. That, my smart friend, is the majority of our bill, and that is because the model hasn’t really got a memory and it’s reading information it already read on the previous turn. In our programs, we learned that the single most reliable predictor of deployment failure is process length. So, keep your agentic processes short, and bound the decision surface to a human, keep the number of external systems to a minimum, and agents work. And when you have a stable system, you can extend the process and add dependencies, take out a human-in-the-loop, or increase the ambiguity of the inputs and what have you. But when your ambition is too high, you will want to immediately start with long-horizon multi-step workflows, and within a short period, you will watch your carefully constructed agent infrastructure produce something that an employee who is irritated by three consecutive nights of overtime, could not have managed to mess up more.
Another problem is the cost of governance. I dedicated a few articles to the subject earlier, so if you want to read up on it, there’s even more. This is the most important fail factor for nearly a third of the 177 agentic deployments we’ve been following over the last year. The entire economic case for agentic automation is built on the idea that you replace expensive human labor with cheaper compute, but it only holds if the governance overhead of the agent program does not consume the savings.
Governance in this context is the combined cost of monitoring and fixing what agents are doing. This includes auditing their decisions, maintaining the compliance documentation that your regulate company requires before they will let your agents touch anything important, or handling the incidents when agents do something unexpected, and last but not least, training the people whose job description is now “the person who watches the AI”. The cost of governance is consistently the primary bottleneck in enterprise agentic deployment, and it is more limiting than model capability and data quality, and it’s because organizations price it incorrectly in the business case. They budget for the model and the infrastructure and operations, but not for the cost of governance. The production audits that we’ve done suggest that raw model inference is only about 20% of the true total cost of ownership of an agentic system, and that the remaining 80% is consumed by the surrounding infrastructure and that includes the observability pipelines, governance guardrails, and also the multiple execution loops that you run into which we couldn’t have budgeted for. We discovered this first in month eighteen of our program, after three rounds of executive presentations about imminent productivity transformation, and at that time, our program wasn’t in a good shape.
Last big problem I want to touch on is scale mechanics.
If you have an agentic AI program that you run with a central team against an enterprise with thousands of processes, now that is an economic impossibility. The math just does not work. You cannot hire enough people with the right skills, and then move fast enough to stay relevant, and maintain knowledge about all those business domains simultaneously to produce deployments that actually work.
The only way to make the economics function at enterprise scale is to turn the organization into a participant in its own agentification. Yes, I think the solution to scaling agentic AI is what the industry calls citizen development. This is the practice of giving business users the tools, and training, and the pre-set governance guardrails to build their own agents within a framework that the central team maintains. Of course you need kindle to get the fire going, and you, as a central team need to have operational knowledge about agentic workflows, a control plan with observability and explainability, and also governance and policies setup first in and around your system. But when that’s setup and you’re running a successful pipeline, you can open the gates one at the time and allow for the tech savvy end-users to participate. That’s what I call a Zone based deployment strategy, whereby you allow end-users armed with, say, a Copilot studio to develop their own local automations (Zone I), and when there’s nothing exploding you can ramp up and allow simple bounded workflows, starting and ending with a HITL (Zone II).
And yes, this is uncomfortable for the central team because it means giving up control, and it is uncomfortable for business users because it means acquiring new skills, and it is therefore one of those strategies that everyone agrees with in theory and nobody prioritizes in practice until the central program runs out of budget and is forced to find a different model.
*Read this here.
Why the factory model eventually works, and the word “eventually” is doing a lot of work here
The thing that makes me think the current fragility is temporary rather than fundamental, is the re-usable pattern and agent library that we’re building. Every deployment we complete produces reusable artifacts. Say, an orchestration pattern for a document extraction workflow can be adapted for a different document type in a fraction of the original development time, and that goes for governance wrappers or an integration layer to connect an agent to an ERP system. Those investments will be useful the next time we need to connect an agent to our ERP system, because the ERP system is almost certainly still there and still being difficult. A factory, in the sense I use that word, accumulates this kind of institutional capability over time, and there is a point in the maturity curve where the cost per deployment drops sharply because you are no longer building everything from scratch.
I think most agentic programs can approach this point, but you have to have something we call “de lange adem” which is a Dutch phrase for “the long breath” or the willingness to keep funding something that has not yet paid back. And that is a genuine prerequisite for this kind of program. Organizations that cannot sustain eighteen months of negative returns before the compounding effects of the central investment should probably be more honest with themselves about whether they are ready to build an agentic AI factory at all.
I know this particular arc well. I have lived it. And I am writing this piece specifically because we are now, for the first time, seeing the first organizations that committed early and built seriously begin to emerge from the red. That is the story underneath the story of agentic ROI in 2026. It isn’t something you’ll see vendors present in their decks, because these do not benefit from explaining how long the path actually is.
But it is the story the production data tells, my friend, if you read it carefully and strip out the survivorship bias.
What “agentic” actually means, and why most of it is not
Before we’re going to look at what is working, it is worth being precise about the definition problem, because the definition problem is responsible for a significant portion of the current market confusion. I mentioned it before in the introduction, but let’s go a little deeper here.
Gartner has estimated that of the thousands of technology vendors currently claiming agentic capabilities, somewhere around 130 of them actually provide systems that execute goal-driven, non-deterministic processes in the way the term is supposed to mean. The rest are simply chatbots calling themselves autonomous agents. This is what Gartner calls agent washing, and it is the dominant activity in the enterprise AI marketing ecosystem at the moment, and it is a pain in the ass if you ask me.
╰┈➤A system that takes a user prompt, runs it through a language model, and returns a response is not an agent.
╰┈➤A system that takes a high-level goal, decomposes it into executable steps, calls the relevant APIs, makes decisions at branch points, handles failures, and routes exceptions to humans when confidence drops below a threshold, now, that is an agent!
Well, here you have it. A chatbot that answers HR questions does not save you money unless you were paying someone to answer HR questions in the first place, and even then the savings are modest, unless implemented at scale. But an agent that wears a hard hat and autonomously processes expense claims end-to-end and escalates the edge cases to a human reviewer, now that changes headcount economics. But the confusion between the two has led to whole a generation of AI investment decisions that are going to look embarrassing in about eighteen months.
The cases that actually work, and what they have in common
For this piece, I’d done some digging with my self organizing bunch of Oompa Loompas and I gave them the directive to find me proof of ROI on agentic AI or don’t come back at all.
And about a workday later they did, and they reported back that the companies that have deployed real agents against processes in production are generating numbers that are worth taking seriously. That is, with the caveat that those numbers are almost exclusively coming from a narrow category of use cases.
Take Suncoast Credit Union. They deployed a fraud detection workflow in which automated processes read transaction files from the Federal Reserve, applied detection logic, analyzed documents for anomalies, and then prioritized cases for review, and the result was a 75% reduction in fraud losses over two years and $3.3 million in prevented fraud year-to-date.
The thing here is that it is a combination of an agentic workflow that handles the process-steps with traditional machine learning for outlier detection. So, I guess the ROI is not so much because of improved operational efficiency, but more because you’re more effective at catching fraudulent transactions.
But that is different with Halkbank that operates in Turkey who built a money transfer workflow that combined document scanning with optical character recognition, AI extraction of unstructured data, decision logic, and integration with core banking APIs, and that took the processing time for a fully AI-handled order down from 54 seconds to 9 seconds but it also reduced error rates by 50% and freed up 32 full-time employees to do something more interesting. Then take Johnson Controls that automated 6,500 daily invoices across 2,000 properties and saved 900,000 manual engineering hours, and also managed to document $18 million in direct labor cost savings. Automation Anywhere deployed agents against their own support operations where they achieved a 6.7x return on investment in under three years, and resolved 35% of cases completely autonomously. But I’m quite skeptical when it comes to tech-providers showing ROI on the same stuff they’re selling. These are real numbers from named organizations, and they are impressive numbers, and I do not want to dismiss them.
The pattern they share is worth understanding carefully, because every single one of these wins are siting in what I call Zone II territory in our suitability framework, which is the zone occupied by processes that are high-frequency, low-ambiguity, rule-bounded, and measurable. Fraud detection has clear outcome metrics and pre-defined decision criteria. Money transfers follow regulated procedures with limited variance, and invoice processing is repetitive document work with stable schemas and AA’s customer support triage operates against SLAs. These aren’t the glamorous problems that make good demo videos, but they are boring and repetitive and well-defined, and therefore exactly the problems where agentic automation actually works and generates clean ROI, and does not produce governance nightmares at two in the morning.
I sometimes refer to our business as “working at the frontier of unsexy AI”, and that says it all.
It is a very complex world, but the end result is not a bigger, faster or smarter model like Fable, or a beautiful harness like Hermes or even a beautiful World Model like Jepa, and the problems that do look exciting in demos mostly fail in production because these are the ones at the other end of the complexity spectrum. Things like open-ended reasoning tasks with ill-defined success criteria or multi-step workflows that touch systems of record across multiple vendors with inconsistent APIs.
The vendors that are trying to sell you a multi-step agentic workflow at scale are selling you something that exists in proof-of-concept form in a small number of very well-resourced companies but it does not exist in production form in most of the places where it is being promised.
The survey data the vendors didn’t include in their decks
In 2025, Capgemini did some research among 1500 execs in 14 countries called “Rise of Agentic AI”, and they found that only 15% of business processes are expected to reach even semi-autonomous operation levels in the next twelve months, and that trust in fully autonomous agents has fallen from 43% to 27% over the past year.
And McKinsey, in their “Seizing the agentic AI advantage” found that only 23% of organizations have successfully scaled an agentic system anywhere in the enterprise, and that within any individual business function the number of organizations with scaled production deployment never exceeds 10%.
And then there’s Gartner who presented their research at the Gartner IT Symposion/Xpo where they stated that more than 40% of agentic AI projects to be cancelled by the end of 2027 due to unclear business value and rising costs, or inadequate risk management. They are the statistics of a tech that is in the difficult, and we are now in a technically honest phase between interesting promise and operational maturity.
Here are a few findings from these reports:
╰┈➤Organizations that treated agentic AI as a technology adoption problem instead of a process redesign problem are stalling at 23%.
╰┈➤Organizations that forgot to budget for the governance infrastructure will make up a significant share of Gartner’s 40%.
╰┈➤Organizations that deployed agents against processes that were too long or ambiguous, or too deeply embedded in legacy system complexity are the ones that find themselves explaining to their steerco why the timeline needs to be extended again.
You will agree with me that none of this is surprising in retrospect.
And yes, all of it was predictable from the deployment data from the 177, but the problem is that the deployment data is not what gets shared at conferences. What gets presented at conferences is the Suncoast number, the Halkbank number and the Johnson Controls number, where they’re stripped of all context, and the multiple failed pilots that preceded the successful one, and I’m sure they’re all stripped of the eighteen months of infrastructure investment that made the success possible.
Your AI budget disappeared before the first deployment
The classical SaaS business model operates at a gross margin between 75% and 85%. For every dollar of revenue, the infrastructure cost is somewhere between 15 and 25 cents.
This is your reference.
But agentic AI flips this, and model inference costs (the actual cost of running the language model through a reasoning loop), consistently runs at 55% to 60% of the total cost of an interaction. That structural shift in the cost base is an architectural property of systems that reason per transaction.
Yes, reasoning, tool calling, planning, coordination, memory retrieval, agent-to-agent communication, validation loops and governance, all cost tokens as well
Take Gorgias which is an e-commerce customer service platform, and they documented what their CFO Kunal Agarwal called the four-dollar lesson. They discovered that an agent running open-ended reasoning loops without strict token budget constraints could consume more than four dollars in compute per customer interaction before resolving anything. If the interaction then fails to resolve, the vendor has spent four dollars to produce a failed customer experience. Uncontrolled recursive loops in open-source framework trials have resulted in singular execution sequences accumulating thousands of dollars in an afternoon. The average cost of a stalled pilot (one that never makes it to production), reaches approximately $340,000 in direct engineering hours and model compute, without counting opportunity costs.
This is why we created the Semantic Density per Dollar model (paper in comments). Existing AI benchmarks measure what an agent can do, but they rarely measure what it costs to achieve the outcome. A model that solves a task that is using 10,000 tokens and a model that solves the same task using 1,000,000 tokens are often treated as equivalent successes, despite having radically different economic characteristics. And for enterprises that token consumption is a direct operational expense and every planning step, tool invocation, etcetera contributes to the final cost of delivering a business outcome.
We developed the SDpD model to quantify the relationship between outcome quality, task complexity, and computational expenditure. SDpD measures how much useful business value is produced per dollar of token spend and it combines task success, workflow complexity, and token cost into a single benchmark. This way we are able to compare agents on their capabilities and also on their economic efficiency. To complement SDpD, we also developed something that sounds really cool but is not rocket science. We call it the Dynamic Token Budget Negotiation Protocol that allows our agents to dynamically negotiate computational budgets during execution. So yes, instead of fixed token allocations, our resources are continuously redistributed based on task complexity and expected value! This reduces waste and also improves utilization, and helps agent systems achieve higher outcome quality at lower cost.
We are now using this model as a foundation for our Agentic FinOps, where the objective is to build agents that can reason economically at enterprise scale.
Having said that, the raw token price has dropped roughly 75% year over year. I must say that this is good news but you have to understand something is happening which is called “the Jevons paradox”. That is a concept from nineteenth-century economics that describes what happens when a resource becomes cheaper. William Stanley Jevons observed in 1865 that making coal more efficient to burn increased coal consumption rather than decreasing it, because apparently cheaper efficiency unlocks demand that previously did not exist. And now that same mechanism is running in agentic AI. When the cost per token falls, the volume of tokens consumed per task increases by multiple orders of magnitude, because cheaper compute licenses more ambitious workflows. Goldman Sachs projects that global token consumption will reach 120 quadrillion tokens per month by 2030 and that is a 24-fold increase from 2026.
The per-unit token price is falling, but the total bill is climbing. This is the Jevons paradox, enterprise edition.
Our observation is that when we apply the DPpD with negotiated token allocation, we roughly reduce 40% of the cost against static tokenomics approaches, but that 40% reduction is not free. Building the budgeting infrastructure requires engineering time and architectural discipline that most programs are not applying because they are all too busy chasing the next deployment milestone.
That is why I think working at the unsexy frontier requires support from a research institute and that’s why I advise you when starting an agentification program to work together with a university or a specialized AI research institute (wink – wink).
By the way, context is not a landfill
This theme has its own chapter because it is the most common and most expensive mistake I see in production agent systems.
Every time an agent takes a step, it receives as input everything that happened before that step including system instructions and the full history of the conversation. This is what’s called the context window and it accumulates like a filing cabinet. The model on step 28 is processing roughly 28 times more input than the model on turn 1, and it is being charged accordingly. A three-retry pattern on a single failing API call triples the input token cost for that processing block, because each retry re-submits the entire accumulated history. And across thousands of concurrent agent sessions, this retry-driven spend multiplier inflates your budgets without producing any progress.
The research on what context accumulation does to model quality is, if anything, more alarming than what it does to the cost.
Chroma’s 2025 context rot research evaluated state-of-the-art models including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 across variable context lengths (with constant task complexity). They saw that model performance degraded systematically when the input length scaled up. The insertion of topically related but irrelevant content within the context window significantly degrades the agent’s ability to locate critical data, and this vulnerability increases as the context fills. And when you’re shuffling the logical order of documents in the context window, well, that also degrades the agent’s chronological awareness and semantic inference. Now, this context rot manifests as a sharp drop in execution accuracy after 20 to 30 turns and output quality is degrading by 30% or more, which then prompts further correction attempts leading to cascading errors, and unproductive token consumption!
So people, a million-token context window is not a free pass to unlimited memory being used as a trash-can. It is a capability that degrades under load and costs a lot of cash when it degrades.
The economically rational response to this is structured memory management.
In this practice you’re replacing long-context accumulation with a persistent memory system that extracts and retrieves only the information actually relevant to that current step.
Take for instance Mem0 (an open source production memory framework). They reported more than 90% token cost reduction and 91% lower 95th-percentile latency compared to full-context approaches. Ok, I’m taking it with a grain of salt because they’re clearly marking their own homework.
In a more impartial breakeven analysis “Memory systems versus long-context inference”, where they were comparing fact-memory systems against long-context inference, saw that a memory system becomes cheaper after approximately ten interactions at a context of 100,000 tokens, because the memory system pays a one-time write cost and then retrieves only the compact facts needed per turn. But the long-context approach re-processes the entire history on every turn indefinitely.
But the industry largely ignored it because the path of least resistance when building an agent is to pass everything into the context and let the model sort it out, but since there’s now more attention on tokenomics, this category of products will gain more attention.
*_Read this post about the memory system I built for my personal Hermes worfklows
Get your tokens at the thrift store
Every major model provider now offers substantial discounts for cached / second-hand tokens.
You get a cached token when the model has already processed and stored a previous token, and when it appears again in another request, your provider charges you a fraction of the normal price instead of computing it from scratch.
The thing about agentic workflows that makes caching so disproportionately valuable is that you are running a prompt a hundred thousand times, against every invoice that comes in, or whatever your pfocess is, and the static prefix, which is the collection of your instructions and whatnot, is identical on call number 847 as it was on call number one.
OpenAI charges roughly ten times less for cached input tokens than for standard input tokens across multiple models. Anthropic applies the same approximate ratio for cache hits. Google Cloud offers 90% discounts on implicit cache hits. Fireworks prices cached input at 50% of standard input rates and applies the same discount to batch inference. Groq prices prompt-cached input for open-source models at, say, 50% of uncached rates.
Every major provider has independently converged on roughly the same pricing structure, and this means every major provider has independently concluded that helping you cache your tokens is good for their business because it keeps you in their ecosystem. If your agents are repeatedly pushing the same system instructions, like agentic workflows, you are paying a repetition tax that the provider is happy to collect.
Just this little bit of knowledge warrants a whole blog on its own.
How I, with my Hermes setup, organize this is by putting all static content (system instructions, tool schemas, policy documents) at the very beginning of every prompt and never change it between calls, so this way the provider recognizes the identical prefix and serves it from cache.
Buuuut, the research on prompt caching strategy is more nuanced than simply turning it on.
A 2026 study by Elias Lumer et al., who were running over 500 agent sessions on a deep research benchmark, saw that prompt caching consistently reduces API costs by 41% to 80% and it would improve time-to-first-token by 13% to 31% across major providers, but they noticed that the specific caching strategy matters considerably.
Naive full-context caching, which simply allows the provider’s mechanism to operate on everything, can paradoxically increase latency, because volatile tool outputs trigger constant cache writes for content that will never be reused across sessions, and that incurs write penalties without yielding read discounts. But caching only the static system prompt, and allowing dynamic content to be recomputed on each turn, produces more consistent cost and latency benefits.
If you didn’t understand any of this, that’s ok, I had to read the research summary a couple of times as well. The only thing you have to remember is this design rule:
Put everything static at the top of the prompt, do not change it between calls, and push all dynamic content to the end.
It sounds obvious. Most production systems do not do it because prompt templates are written quickly under deadline and nobody audits them afterward.
More agents is not a free lunch
The multi-agent architecture is the subject of even more vendor enthusiasm and more poorly examined deployments than any other architectural pattern in the agentic space, and I feel qualified to say this because I have been on the receiving end of several pitches for multi-agent systems that turned out to be just a single agent orchestration layer but it came with a marketing deck describing it as a society of minds.
The actual research on when multi-agent architectures help and when they hurt is considerably more careful than vendor marketing.
Guys at Google DeepMind, evaluated 180 multi-agent configurations in a controlled study, and they realized that multi-agent coordination produces substantial performance improvements on highly parallelizable, decomposable tasks. It had an 81% improvement in success rate for tasks like financial report compilation that can be divided into genuinely independent subtasks. On strictly sequential planning tasks, the same multi-agent architecture degrades performance by 39% to 70% compared to a single well-resourced agent.
This is what I call “the sequential penalty”, and it occurs because the computational budget of a multi-agent system is fragmented into coordination overhead and message formatting, and that is starving the actual task execution. A single agent who is solving a sequential problem averages 7.2 active reasoning turns per task but a multi-agent team that is solving the same problem spends its token budget passing intermediate results back and forth and running out of budget before it runs out of problem.
Yup.
And get this. Now take error propagation dynamics. Multi-agent systems who are operating in parallel with no central controller amplify downstream logical errors by 17.2 times, as mistakes cascade unchecked across independent nodes. Amplify. And this leads to the combinatorial explosion I wrote about here.
Systems that are using a central hub-and-spoke orchestrator limit this error amplification to 4.4 times by verifying intermediate outputs before they propagate. The architectural choice between centralized and decentralized multi-agent coordination determines whether your errors compound exponentially or merely multiply by a relatively manageable factor.
Then take the communication overhead.
In a fully decentralized multi-agent topology where every agent can communicate with every other agent, the number of potential communication channels scales quadratically.
Ten agents require 45 unique communication channels, each of which adds message processing latency and token consumption.
Your design default should be a single well-equipped workflow agent with good tools and proper memory architecture for the vast majority of enterprise automation tasks, escalating to a manager-worker pattern only when the task is demonstrably parallelizable. And you should only deploy a true multi-agent architecture only when there is empirical evidence that the task structure warrants it. The alternative which is defaulting to multi-agent because it sounds more sophisticated, well, that is simply a reliable way to pay more for worse results.
The security section you don’t read until the shit hits the fan
I deliberately keep this piece short, because I know from experience that really no one reads cybersecurity articles. The blogs I’ve written about this topic sit at the bottom of the viewer ranking.
The AGAT Software 2026 Security Survey recorded that 82% of executives express confidence that their existing security policies protect against unauthorized agent behaviors. But then, only 14.4% of deployed agents entered production with full security or IT department sign-off. Eighty-eight percent of companies reported confirmed or suspected security incidents linked to AI agent actions over the past year. And in healthcare, that number climbs to 92.7%.
Approximately 45.6% of technical teams are using shared API keys for agent-to-agent interactions. This of course makes individual tracing impossible when something goes wrong, and something always goes wrong in the end. A quarter of deployed agents have the capability to autonomously spawn and credential new sub-agents and that creates execution paths that weren’t audited. But only 21.9% of organizations treat agents as unique, identity-bearing service entities with distinct access scopes and audit trails.
Yeah, you need to treat agents as if they’re humans, but even with less trust. The organizations that are surprised by this in 2027 will have been warned in 2026.
How to actually reduce the bill
There’s a lot of knowledge floating around on cost optimization for agentic systems, but it is usually technically dense, and almost entirely unread by the people who are signing off the token invoices. But because the Oompa Loompas did such a great job, I have spent a meaningful amount of time reading it so that you do not have to. The result is a practical hierarchy.
The highest-impact, lowest-complexity intervention available to most programs right now is prefix and context caching and the reason it is not universally applied is not that it is difficult to understand or implement, but it requires discipline about prompt structure that most teams did not apply when they first wrote their prompts. The principle behind this, is that any content that remains static across multiple calls should sit at the beginning of the prompt and should not change between calls. Content that changes on every call should go at the end. This ordering allows your provider to cache the expensive static prefix and charge the discounted cache rate on subsequent calls. The “Don’t Break the Cache” paper where brainiacs were studying over 500 agent sessions, found 41% to 80% cost reduction from this structural adjustment alone!
So yes, every large agentic production team should have a focus on tokenomics.
Next to that you have to take care of your memory architecture. You have to treat context as a budget, as I said in the previous chapter, and a well-designed agent should actively manage that budget and should not allow it grow without bound. Agents that need to maintain awareness across many turns need a persistent memory system that stores extracted facts and retrieves only what is needed per step. This dramatically outperforms full-context accumulation once you reach roughly ten turns of interaction. Mem0’s token-efficient memory algorithm maintains competitive accuracy with roughly under 7,000 tokens per retrieval step (compared to 25,000 tokens or more for full-context approaches). The write cost is paid once, but the read cost is paid cheaply per turn, and the context window stays small enough for the model to actually use it effectively.
Also take model routing into consideration. Not every step in an agentic workflow requires a frontier model. Period. Document parsing steps, simple extraction steps, etc. can be handled by smaller and cheaper models. RouteLLM (model router) boasts more than 2x cost reduction without quality loss on real workloads. AFlow (a workflow optimization framework) found smaller models outperforming GPT-4o on specific tasks at 4.55% of the cost.
The design principle that flows from this, is that frontier models should be the escalation path, not the default. But most production systems led by a manager suffering from Dunning Kruger, have this exactly backwards because the path of least resistance when building an agent is to point everything at the best available model and not think about it further.
Another layer you can optimize, is what I would call architectural sleeping.
When you examine the actual runtime behavior of a large fraction of agents, you notice they spend the vast majority of their operational lifetime waiting for a human approval or an external API to respond, or a scheduled trigger. But have you ever thought about what happens when an agent waits? I mean waiting for an event that may or may not occur in the next hour. This means you have to keep those agents on always-on virtual machines or containers, for which you have to pay for compute during all the waiting. The durable actor model, implemented in platforms like Cloudflare’s Durable Objects and Temporal’s durable execution engine, inverts this by completely hibernating agents during dormancy and waking them only on event arrival.
A deployment of 10,000 agents where only 1% are active at any moment requires 10,000 always-on instances under a VM model and only around 100 active instances under the durable actor model. That is a 99% reduction in idle compute spend. The implementation complexity is real and should not be understated, but for programs at meaningful scale this architectural choice eventually becomes non-negotiable.
Context compression is another technique that is really powerful but it requires more engineering smarts than some teams have. The LLMLingua prompt compression framework showed that it is capable of up to 20x compression with minimal quality degradation. LongLLMLingua then showed 4x token reduction and up to 94% cost reduction on long-context tasks, with 1.4x to 2.6x latency improvements.
Agents call the same external tools thousands of times a day. Without caching, you pay the full API cost every single time. With multi-level hierarchical caching, you pay once and serve the result from memory and this way you can cut tool execution costs by 73.3% and query processing time by a factor of 13.3. And when you add dynamic tool retrieval to the mix you’re adding even more. Most agents receive their full tool catalog on every step, therefore the agent that is looking up, say, a customer record will also receive the complete schema for the payment processor and the shipping calculator, and fourteen other tools it will not touch on that step. And that all costs tokens to process. But when you fetch only the tools that are relevant to the current step and inject only those into the context, you can save up to 95% fewer context tokens per step, 32% better tool selection accuracy, and 70% lower end-to-end episode costs.
This little bit of knowledge also warrants a whole article on its own.
But remember, the ordering here matters. You have to fix caching first, because it requires no new infrastructure and the provider already wants to give you the discount. Then fix memory architecture second, because it addresses the quadratic scaling problem that makes long workflows expensive. Following that is model routing, because it requires calibration data but the savings are really substantial and durable. But fix idle compute last, because the architectural migration is significant and should wait until the lower-hanging fruits have been applied and the program is large enough to justify the migration cost. Then there’s complex stuff like speculative decoding, quantization, and hardware-level optimization, but I’d put those at the bottom of this hierarchy since they require infrastructure maturity that most programs do not yet have.
I am a firm believer that agentic ROI can be achieved, but only when executed at scale, with a brutal focus on cost reduction. ROI is there for you to have, but it is fragile. We are proving this at this very moment, but the learning curve is steep.
What is the state of Agentic AI?
The chatbot era is over, everyone agrees on that, and they say, the autonomous enterprise era has supposedly begun, but almost nobody can point to one running in production.
But what I’m seeing is that the first half of 2026 is in the gap between those two claims.
Every major technology vendor now offers something they call an agent platform. Enterprise software companies have bolted agents onto ERP, CRM, ITSM, HR, and every other acronym that has a budget attached. The market has already moved past the question if agentic AI is possible and is now switching to a new one where we need to find out if you can deploy it without burning through your margin and scale it past the single pilot. The answer for now, is “sometimes, partially, and with more human oversight than anyone planned for”.
I see two narratives running simultaneously and they are almost completely disconnected from each other.
Take the vendor narrative. If you read the platform announcements, and the benchmark updates, you get the feeling that autonomous agents are transforming customer service and software, medicine etc. Every product page now has the word “agent” on it now, and some even deserve it.
But what I’m seeing from the deployment side is that its considerably less photogenic. When you look at what organizations are actually running in production, most so-called agents turn out to be supervised workflow systems that are operating in tightly controlled human-boundaries. Yes, mostly Zone II systems. They can automate meaningful work, but they are rarely autonomous in the way that popular media portrays. Capgemini, with their global survey in 2025 said that only 14% of organizations had reached partial or full-scale deployment of AI agents. Another 23% were still running pilots. The majority of business processes were not expected to reach higher levels of autonomy for several more years, and that timeline keeps sliding.
Agent adoption is accelerating, we all see the benefits, and no, we are past the scare of dystopian job loss that Big Tech was scaring us with*. Autonomous business operations, however, are not. There’s a lot of talk on zero human companies (AI native) and AI first enterprises, and as you know, the industry has a financial incentive to pretend they are the same thing.
We have proven that agentic AI can generate measurable value. The gains are real and they are showing up in customer service, IT operations, document processing, claims handling, financial workflows, network operations, software engineering, and internal support functions. The deployments that were published talk about productivity improvements between 10% and 30%, process cycle time reductions between 20% and 70%, cost reductions between 10% and 40%, and error reductions between 15% and 60%. Capgemini (in the same report) estimates that agentic AI could generate roughly $450 billion in combined cost savings and revenue uplift by 2028 across surveyed economies.
So yes, the ROI is arriving, albeit fragile and mostly bound by humans in the loop, but the autonomy is not and the people who are selling the autonomy story are collecting the ROI from the supervision story without acknowledging the difference.
One of the most striking features of Q2 2026 is that the market still lacks a shared definition of what an agent actually is. I have seen FAQ chatbots being sold in polished agent terminology and then presented to investors as agentic AI. I have also seen RAG applications with a for-loop called autonomous systems. The industry has four categories of things all being sold under the same label. The first category is rebranded chatbots that retrieve information and answer questions. They are useful, but not agents. Then there’s workflow agents that are capable of tool calling and multi-step execution and this is where most current enterprise deployments are. The third category is process agents that are executing significant portions of end-to-end business processes like claims processing or loan review and what have you. This is where the serious ROI numbers come from, but there are lots of risks associated with this choice. Then there is the other side of the spectrum which is about autonomous agent teams and multi-agent systems coordinating specialized agents through orchestration frameworks. This category receives the most media coverage but it is the smallest percentage of production deployments. The ratio between media attention and reality has never been more inverted.
A common misconception is that high ROI requires high autonomy. My own data from 177 deployments across 20 sectors says that the strongest business cases are emerging from narrowly scoped systems operating inside mature, well-understood processes. Organizations who are implemented bounded execution with human-on-the-loop governance and process redesign are consistently outperforming organizations chasing full autonomy. I think that the lesson is remarkably similar to the history of industrial automation.
Factories did not become valuable because they removed every human but they automated the parts that mattered most and left the judgment calls to people who understood the consequences.
The most important development of 2026 might be what I have been calling the Agentic Middle Class.
These are systems that reason, plan, use tools, maintain memory, and execute workflows, but remain tightly constrained. They are neither simple assistants nor autonomous executives. Most deployments I am involved with, now fall into this category, and the market is slowly discovering that enormous value exists in the space between “chatbot” and “fully autonomous company”.
Yes, my smart friend, the unsexy middle turns out to be where the money is. It usually is.
I also notice that within enterprises, the conversation is shifting away from model intelligence toward economics, and this shift is overdue. Every organization deploying agents is discovering that reasoning costs money and that planning and tool calling, and retry loops, and failed worflows, all costs money. And that especially governance overhead costs the most money. And that dynamic tokenomics can account for 40% cost reduction versus static tokenomics approaches, but that requires a level of architectural discipline that most organizations have not built yet. So I am guessing that the industry is now entering an era of Agentic FinOps, and the winners of the next phase will build the most economically efficient agents.
And they are all focussed on cost-cutting, and creating savings in their programs. If you want to know more about saving money in your agentic program, then read this post about Dutch Tokenomics. Going Dutch has a positive meaning for once.
Read: How to build an agentification factory that actually turns a profit
The maturity curve I am watching in our own programs and observing in the data from others show me a timeline that is slower than the marketing of Big Tech, but still faster than the skeptics, and more concentrated in unsexy operational domains than anyone who has been to a vendor conference would expect.
The factory model works, eventually. The compounding effects of the pattern library are real. The citizen development multiplier is real, if you actually invest in it rather than announcing it at the kickoff meeting and then underfunding the training budget for three consecutive quarters. The path to a positive return runs through narrow, high-frequency, measurable processes, through governance infrastructure that is budgeted from the beginning rather than retrofitted after the first incident and through token budgets that are treated as engineering constraints instead of an accounting surprise and through memory and context architectures that keep context small and meaningful, and through the particular organizational patience that most enterprises are constitutionally bad at.
The most important architectural trend of the moment is the shift toward hybrid intelligence. Organizations are abandoning the assumption that every task needs the most powerful frontier model available. But instead they are now gravitating towards adopting layered architectures, using small local models for routine work, mid-tier models for workflow reasoning, and frontier models for exceptional tasks. This, in my opinion, mirrors how human organizations operate. Because every decision doesn’t requires a CEO and therefore not every token requires GPT-5.x. This architectural shift is becoming economically unavoidable given the difference in prices between GPT 4.x and 5.x and Claude’s models.
Agentic AI has proven it can create value, but it cannot create autonomy at scale. The organizations that win this phase figure out how to deploy agents with a differentiated compliance model, and produce the highest business value per token, per dollar, and per unit of human oversight.
*Read: Sam and Dario wants you to think your job is gone so they can borrow another billion | LinkedIn
Signing off,
Marco
Eigenvector builds Agentification factories at scale, for production environments that actually have to pay-off, and Eigenvector Research occasionally publishes papers about why this is harder than the demos suggest.
👉 Think a friend would enjoy this too? Share the newsletter and let them join the conversation. LinkedIn, Google and the AI engines appreciates your likes by making my articles available to more readers.
Leave a comment