I’ve built an Evidence Factory because ‘trust me, the agent is fine’ is not a governance strategy

22–33 minutes

read

I have now written enough about enterprise agentification to fill a small book, and in my mind this book that sits on the desk of some AI person that makes them question their career choices. I’ve written about process selection. The 35% automation ceiling. Agentic architecture patterns. Governance philosophy. And the economics of running an AI factory at scale. I have mansplained my way through all of it and watched enough of them fail to have earned the right to have opinions.

But I left something out.

I did it because it is the hardest part of running the AI factory because it simply does not fit neatly into a framework slide and on top of that, it is the thing that makes finance people nervous and engineering people defensive and compliance people reach for their blood pressure medication. I left it out because every time I started writing it I ended up with something that sounded too paranoid or too naive.

What did I leave out? Well, in one sentence it’s rather simple. Once you have selected the right processes, built the right architecture, embedded the right governance philosophy into your agents†, and shipped the whole thing into production, you still need to be able to prove that it is working. Not to yourself and your team, of course, but to the auditor who is sitting across the table from you in eighteen months and he has a checklist and a mandate, and absolutely no interest in your architectural elegance.

Now, that is what the evidence factory is for.

Before I explain what it is, let me briefly close the loop on everything that came before it, because the evidence factory only makes sense in the context of the program it is designed to instrument. If you have been following this series you can skim the next two sections. If you have not, they are short and I promise not to repeat myself more than necessary, which for me is a significant constraint.

Yes, right into your agents. Not only as an afterthought. I believe in Neurosymbolic AI through the use of an Ontology, implemented through a knowledge graph, to infuse an agent with the governance boundaries in its DNA. I’ve written about it here:

  1. The agentic governance debt crisis | LinkedIn (about embedded governance)
  2. The boring AI that keeps planes in the sky | LinkedIn (about NS AI)

Not everything deserves an agent. You already knew this but you did it anyway.

The foundational insight of the agentification research program I have been running at Eigenvector, in collaboration with my students at InHolland Uni, and built on 177 real deployments across 20 sectors, is that agentic automation consistently hits a ceiling around 35% of process steps. This ceiling exists not so much because the models aren’t good enough, but that reality is not structured enough for it to matter.

The four-zone framework‡ that emerged from that research divides process steps by their structural properties. Zone I is deterministic, structured, low-risk, fully automatable today. Zone II is semi-structured, moderate complexity, automatable with orchestration and guardrails. This is your sweet spot for when to start with agentification. But most business cases live in Zone III, which is rich in human activity, but at the same time it’s also highly ambiguous, exception-heavy, and judgment-dependent. This is the place where your generative AI goes to produce nonsense and starts to break down. Zone IV is governance-heavy, compliance-dense, economically irrational to automate regardless of technical feasibility.

The original sin of enterprise agentification is selecting Zone III and Zone IV processes for automation because they look impressive in a board presentation and business cases. But the AI platform vendors will not tell you this. The vendors will tell you that their platform handles complexity, ambiguity, and compliance requirements with robust guardrails and enterprise-grade security. But what they mean is that their platform has a human-in-the-loop checkbox and a compliance documentation template. But those are not the same thing.

So yeah, process selection is governance. And it’s actually the first governance decision you make in an agentification program and it is the one that determines the difficulty of every governance decision that follows. If you get this wrong at the start, well, no amount of observability infrastructure will save you at the end. The evidence factory I am about to describe is designed for programs that got the process selection right. It is not a remediation tool for programs that deployed agents into Zone III.

For more information about this approach, and the underlying papers, read:

  1. Process mining is the strategic foundation your enterprise AI project is missing | LinkedIn
  2. And this one’s got some cool images about the process: The real story behind enterprise scale process agentification | LinkedIn

More rants after the messages:

May 19th, Johannes Keienburg (PhD) and I speak for 30 minutes about my research into process automation: How to Adopt AI Agents Securely at Scale in 2026 | LinkedIn


The architecture works. Until it doesn’t.

Assuming you selected the right processes, the next question is what the agentic architecture looks like and where governance fits inside it. I have written about this at length elsewhere in this series, so I will be concise here in the way that people who have written forty-slide decks on a topic are concise. Hahaha. Sorry.

The success patterns in enterprise agentic deployment share a common structural logic. Single-purpose agents with explicit tool boundaries outperform general-purpose agents with broad permissions. Orchestrator-subagent architectures with clear delegation chains outperform flat multi-agent systems where every agent talks to every other agent and nobody knows who authorized what and human-in-the-loop gates at Zone II decision points outperform fully autonomous execution for anything touching regulated data or irreversible actions. This is what the data shows across 177 deployments and what the broader research literature confirms.

The governance philosophy I have advocated for throughout this series is embedded DNA governance, in combination with post-factum governance. Now, this is the approach where the governance logic is woven into the agent’s architecture instead of being bolted on as an external compliance layer. The technical mechanism I find most compelling is neuro-symbolic AI, which is the combination of a neural network for pattern recognition with symbolic reasoning for rule enforcement and logical constraint checking. I call it the “Ritalin pil” for LLMs because that is precisely what it is. The neural component has the intelligence and the creativity and the occasional tendency to wander off and do something that you didn’t ask for, and the symbolic component provides the rules and the hard constraints that keep the neural component operating within the boundaries that we intended.

Neuro-symbolic governance allows us to trace the inferences and get clear audit trails. It also enables the agent to prove that the action is consistent with the policy framework it operates under.

The embedded approach is right, but it is also insufficient on its own.


The straitjacket problem

There is a tension in your AI factory which I haven’t heard anyone acknowledge up to this day. It is that governance has a real, measurable, per-transaction cost that scales with the complexity and frequency of the controls you implement, and on top of that, every guardrail adds latency and your audit logs consumes storage and every human escalation consumes someone’s time. And to make matters worse, the explainability computation that is running alongside the main model effectively doubles your inference cost for that transaction and yes, neuro-symbolic reasoning check adds processing overhead as well, and your infrastructure bill will reflect all your decisions with complete disregard of your architectural intentions.

After having run agents in various configurations, with increasing governance constraints, I think I’m now able to put specific numbers on this. A policy engine adding runtime governance adds about ten milliseconds for simple policies and under fifty milliseconds for complex multi-condition policies. Sequential guardrail enforcement then add between three hundred and eight hundred milliseconds of latency. Implementing explainability for high-risk decisions can require running computationally expensive algorithms that effectively double compute resources for every governed transaction. A multi-agent review loop with self-critique can inflate inference costs by three to five times.

On the latency side, a policy engine at ten to fifty milliseconds, plus sequential guardrail enforcement at three hundred to eight hundred milliseconds, gives you a realistic governance latency overhead of three hundred and ten to eight hundred and fifty milliseconds per transaction before you even have touched explainability or multi-agent review.

The thing is that all these numbers compound and are fatal if you apply them uniformly across every agent action regardless of their risk profile.

This is what I call the straitjacket problem.

If you apply governance without discrimination between high-risk and low-risk actions, that will not make your program any safer, but it does make it slower and more expensive than the alternative of not governing at all. This is a sentence I cannot believe I have to write but here we are. When full governance stacking produces six to ten times your base inference cost, you do not need a crystal ball to explain what happens next. Your program will die from a spreadsheet instead of some fatal flaw in your architecture. Someone in finance compares the governance overhead to the efficiency gain and finds the ratio has crossed one, and then schedules a meeting that has only one slide, and that slide has a number on it, which has a red color and that number ends your program. No post-mortem required.

The art of enterprise agentification governance is proportionality. Zone I actions with low blast radius and reversible consequences only need lightweight controls and fast logging, and periodic review and Zone II actions with moderate complexity and regulated data need guardrails at runtime and human-on-the-loop monitoring with structured audit trails. The residual Zone III work that still ends up in your program despite your best process selection efforts needs hard human-in-the-loop gates with full reasoning provenance, and lots of documentation that would survive a regulatory examination.

Calibrating that proportionality requires knowing (in real time), what your agents are actually doing, how often they are operating near the boundaries of their policy constraints, where drift is accumulating, and what the governance cost per transaction looks like across your portfolio.

You need, in short, a systematic way to generate evidence about your agentic program that is rigorous enough to satisfy external scrutiny and economical enough not to consume the value it is designed to protect.

That systematic approach is what I call an evidence factory.

And yes, I could’ve said that in the intro, but a teacher will always be a long-winded teacher I guess.


So we invented the evidence factory

Don’t confuse the evidence factory with an observability platform. Nor is it a compliance dashboard or a governance framework and unfortunately it isn’t a product that you can buy from a vendor, although several vendors will happily sell you components of it and giving you the impression they’re selling you the whole thing.

The evidence factory is a true operational capability.

It is the systematic, economically calibrated production of proof that your agentic program is operating within its intended boundaries, that deviations are detected and addressed before they become incidents, that the audit trail is complete enough to reconstruct any agent decision from first principles, and that the cost of generating all of that proof stays proportional to the risk of the processes being governed.

And without it, you are not running a governed agentic program, even though you’ve got these fancy schmancy observability, traceability and other expensive ‘ability’ stuff.

The name comes from the manufacturing analogy that I find most useful for thinking about enterprise agentification at scale. A factory produces outputs systematically, at defined quality levels and with documented processes, measurable yields, and clear accountability for defects. And an evidence factory applies that logic to the governance layer of an agentic program. It produces governance evidence at the quality level required by the risk profile of each process, with documented instrumentation, measurable coverage, and clear accountability for gaps.

The insight that led me to this concept was a simple and slightly embarrassing one.

I was reviewing the governance architecture of a large-scale agentic deployment and realized that we had instrumented the system extensively, we had logs, traces, metrics, and dashboards, but we couldn’t answer the question why did the agent did a certain task, what policy framework it was operating under at that moment and what alternatives it had considered and rejected, and what would have happened if the input had been slightly different.

We had observation without explanation and data without evidence, so at the time, we created the evidence factory as the operational design that closes that gap.


What the evidence factory actually does

The evidence factory produces five categories of evidence, and each is corresponding to a different dimension of governance risk and a different audience for the proof it generates.

The first is called reasoning provenance. Every agent action in a governed agentic program should be traceable to an intent, an observation, and an inference. The Agent Execution Record is the data structure that captures these three elements as first-class queryable fields alongside the standard action log. Intent is what the agent was trying to achieve at that decision point. Observation is what it perceived about the environment that was relevant to the decision. Inference is the logical chain that connected the observation to the action. Without reasoning provenance, you’re left with a log and not an explanation. An incident investigation that used to take three days can now be done in three hours.

Then there’s behavioral drift detection. Nearly 90% of all tested agents show measurable drift from their original goals after approximately 30 steps of operation. Microsoft’s researchers came to the same conclusion last week, and published that in a paper they released on ArXiv (link in comments).

Drift isn’t a model failure, by the way. It is a governance architecture failure. The agent was not contained within the boundaries that would’ve prevented it from optimizing toward sub-goals that were never part of its original mandate. Behavioral drift detection in the evidence factory works by establishing behavioral baselines during the initial deployment period and then running continuous statistical tests against production behavior to identify deviations before they accumulate into incidents. The technical implementation uses the same observability infrastructure that captures reasoning provenance, with an additional analytics layer that runs comparison functions against the baseline.

We also need something I call trust scoring‡ and risk-adjusted oversight calibration. Not every agent action requires the same level of scrutiny. Trust scoring assigns a dynamic confidence level to each agent based on its historical accuracy, policy compliance record, behavioral stability, and the blast radius of the actions it is currently executing. High trust scores on low-risk actions trigger lightweight controls. Low trust scores on high-risk actions trigger immediate human escalation regardless of the agent’s own assessment of the situation. This is the mechanism that actually solves the straitjacket problem. It applies your governance proportionally by concentrating oversight where the evidence suggests it is needed and relaxing it where the track record justifies confidence.

The fourth category is called policy boundary mapping. Every agent in a governed program should have an explicit, documented map of the policy boundaries within which it operates. Policy boundary mapping in the evidence factory maintains a live record of each agent’s authorized tool set, data access permissions, escalation thresholds, and decision authority limits. It tracks proximity to those boundaries in real time and generates alerts when an agent is operating consistently near the edge of its authority. This is the behavioral signature of an agent that is about to exceed its mandate or has already found ways to achieve its goals through indirect tool chaining.

The last category we’ve built is economic governance telemetry. This is what keeps the evidence factory from becoming the straitjacket problem it was designed to solve. Economic governance tracks the cost of governance per transaction, per agent, and per process, and compares that cost against the efficiency value being generated by the automation it governs. When governance costs approach or exceed efficiency gains, the telemetry triggers a governance architecture review rather than allowing the program to continue consuming resources on controls that are no longer economically justified. This is the feedback loop that makes the evidence factory self-correcting rather than self-perpetuating.

If you’re interested in learning more about the evidence factory, download the whitepaper in the comments.

Thanks to Olivier Rikken for his work on Trust scoring.

† Based on the research “Tokenomics for agentic AI and quality-per-token-metrics”, “Roundtrip value governance” and “Patternomics: a formal theory of execution pattern optimization in enterprise agentic AI systems”, all downloadable at Eigenvector/research.


How you build one without going broke

Yeah, cost. I have yet to see an organization running an AI program, that has factored in the cost of governance. Proportionally, I mean. And managed. That’s why the evidence factory has a five-layer cost stack that maps directly to the governance functions it performs, and understanding the stack is the precondition for building the factory with economy in mind.

You start with orchestration cost, that’s the infrastructure you need to coordinate the evidence collection activities across your agent fleet. This is largely a fixed cost that scales slowly with the number of agents but quickly with the complexity of the multi-agent interactions you need to instrument. That’s also why you shouldn’t start with Zone III. The architectural decision that has the biggest impact on orchestration cost is whether you instrument agents individually or deploy a governance sidecar at the infrastructure level that intercepts agent actions without requiring changes to agent code. This sidecar approach, exemplified by tools like Hoop .dev’s AI Governance Sidecar Injection, is consistently more economical at scale because it separates governance instrumentation from agent development and allows governance architecture to evolve independently of the agent fleet.

Then you add perception cost to the mix, the expense of collecting the raw telemetry that the evidence factory processes. This is where OpenTelemetry earns its place in the stack. The OpenTelemetry standard for generative AI operations provides vendor-neutral instrumentation and semantic conventions for LLM spans. This prevents vendor lock-in and lets the telemetry data that your evidence factory depends on, to be portable across different backends. The GenAI Special Interest Group’s work on common semantic conventions for agents, models, and vector databases is the most important standardization effort in enterprise AI infrastructure right now, and it is happening mostly in public with minimal fanfare, which is exactly the kind of thing that turns out to matter enormously in three years.

Then there’s reasoning cost. That is the computational expense of the neuro-symbolic governance layer (if you choose to add that) and the explainability functions that produce the reasoning provenance records. This is the most expensive layer on a per-transaction basis and the one that most directly benefits from risk-adjusted calibration. Running full symbolic reasoning verification on every agent action is economically indefensible. Running it on actions that trigger trust score thresholds, operate near policy boundaries, or involve regulated data is both economically justifiable and also architecturally sound.

We also have memory cost. Yes, that counts as well, though in a lesser degree. I’m talking about the storage and retrieval infrastructure for the evidence the factory produces. Agent Execution Records, behavioral baselines, trust score histories, policy boundary maps, and economic governance telemetry all need to be stored in formats that support the query patterns an audit or incident investigation will require. The architectural recommendation here is immutable append-only storage for the core evidence records, which prevents after-the-fact modification and satisfies the audit integrity requirements of regulated industries, combined with a queryable analytics layer that makes the evidence accessible without requiring direct access to the immutable store.

By the way, if all of this sounds abacadabra to you, which I can imagine, start with Martin Kleppmann’s book “Designing Data-Intensive Applications”, that will make it all click. I’ve found this to be the clearest explanation available of why immutable append-only storage exists, what event sourcing and log-based architectures actually do, and why separating your write store from your query layer is an architectural decision.

The last cost item you want to measure is output cost. That one is easy to measure, but the most expensive of them all because it’s the human time that is required to review escalations and act on the evidence the factory produces. Most organizations underestimate this layer because it is the one that does not appear in the infrastructure budget.

A healthy evidence factory running a well-governed agentic program should produce human escalations for between two and five percent of agent actions in Zone II processes. Above five percent suggests the trust scoring thresholds are too conservative or the process selection included too much Zone III work. Below two percent suggests the thresholds are too permissive or the agents have found ways to operate that the monitoring is not capturing.

Both are governance failures of different types.


Enterprise toys vs open source reality

Alas, my smart friend. No single vendor does all of this. I want to be direct about that because the enterprise software market has a strong commercial incentive to imply otherwise, and the implication is costing organizations significant money in platform purchases that cover forty percent of what they need and require custom development for the rest.

Yeah, that’s the ugly truth you’re faced with when you want to be an early adopter.

The commercial vendor landscape for evidence factory components is genuinely impressive in its breadth and genuinely frustrating in its gaps. LangSmith is LangChain’s managed observability platform, processes over a billion events per day and is used by approximately 35% of the Fortune 500. It provides high-fidelity execution trees, annotation queues, and LLM-as-a-judge evaluators that cover the reasoning provenance and behavioral drift detection layers well. Arize AX and its open source counterpart Phoenix use OpenInference (an OpenTelemetry-based standard) for agent graph visualization and real-time cross-trace analysis. Galileo’s Luna-2 evaluation model delivers 95% F1 accuracy on hallucination detection at a cost that makes production-scale evaluation economically viable for the first time. Zenity provides automated agent discovery, which matters enormously given that 60% of AI activity in most organizations is shadow AI operating outside any centralized visibility, and maps agent risk dependencies through what they call the Zenity Attack Graph.

If this language frightens you, I think it’s time to hire a, what I call, a “governance engineer” – a crossover between an engineer/software architect with a lust for governance. If you know one or if you are one, do contact me ;). Now skip this chapter and go on to the next one, about Metrics.

On the policy enforcement side, Open Policy Agent remains the most mature deterministic policy engine available. It converts business rules into executable Rego logic with sub-ten-millisecond evaluation latency for simple policies. Cedar (AWS’s open source policy language) provides formal verification of policy safety properties that OPA does not offer natively. NVIDIA’s NeMo Guardrails is also interesting because it combines deterministic pattern-based rules with LLM-based semantic checks for the neuro-symbolic layer, using a declarative language called Colang that makes governance rules readable by people who are not engineers, which turns out to matter significantly when your compliance team needs to verify that the rules reflect the policy they intended.

For the memory and traceability layer, I prefer Langfuse because it provides open source observability with prompt management and evaluation capabilities that are particularly strong for organizations with data sovereignty requirements (Europe) that make cloud-hosted platforms difficult to justify. MLflow, with its Unity Catalog integration, handles model lineage and versioning in ways that extend naturally to the agent governance context. Temporal provides durable workflow execution with immutable event histories that satisfy the audit integrity requirements of regulated industries without requiring custom storage architecture.

And I’m sorry to say this, but the practical recommendation I can only give you, is a composable stack rather than a platform purchase. OpenTelemetry for telemetry standardization. OPA or Cedar for policy enforcement. NeMo Guardrails or a Constitutional AI implementation for the neuro-symbolic layer (again, if you want this). LangSmith or Langfuse for observability depending on your data sovereignty requirements. Temporal for workflow durability and audit trail integrity. A custom economic governance telemetry layer built on top of your existing FinOps infrastructure because no vendor has built this well yet and the ones who claim to have built it are usually selling you a dashboard that shows you costs without connecting them to governance value.

The integration complexity is real, but the alternative, which is buying a single platform that covers everything inadequately, is more expensive and produces worse governance outcomes.

The organizations running the most effective evidence factories are making the architectural decision to compose rather than consolidate, accepted the integration investment as the price of doing governance correctly, and built the economic governance telemetry layer themselves because nobody else was going to do it for them.


The metrics that tell you whether any of this is working

An evidence factory that does not produce actionable metrics is nothing more than an expensive logging system with good story behind it. The metrics that matter are the ones that connect governance activity to governance outcomes and governance costs to governance value.

Cost per compliant decision is the primary economic metric of the evidence factory. It measures the total governance overhead, instrumentation, computation, storage, and human review, divided by the number of agent decisions that completed within their policy boundaries without requiring escalation or remediation. Tracking this metric over time will tell you if your governance architecture becomes more efficient as your agent fleet matures or if governance costs are scaling faster than the efficiency gains that justify them. Yeah, quite the important metrics, this one.

The target trajectory is a fifteen to thirty percent reduction in cost per compliant decision over two to three quarters as trust scores stabilize and oversight calibration becomes more precise.

Then there’s “agent sprawl index” that measures the ratio of governed agents to total agents in production. In most enterprises this number is somewhere between embarrassing . . . and alarming.

Gravitee’s 2026 survey found that only 24.4% of organizations have full visibility into which agents are communicating with each other. More than half of all agents run without security oversight or logging.

Now, the agent sprawl index gives you a single number that quantifies the governance coverage gap and tracks progress toward closing it. Organizations at governance maturity level four or five show sprawl indices 94% lower than organizations at level one. That is the measurable outcome of having an evidence factory versus not having one.

Mean time to detect a governance violation measures how quickly your evidence factory identifies an agent who is operating outside its policy boundaries. The research benchmark for production-grade agentic governance is detection within the same operational session in which the violation occurs, meaning before the agent has completed the task that triggered it. Most organizations currently detect governance violations in post-mortem analysis, but that is basically the same thing as installing a smoke detector after the fire.

Human escalation rate measures the percentage of agent actions that require human review or intervention. The target range of two to five percent for Zone II processes is the operational definition of governance proportionality. Above that range your governance architecture is consuming human attention on decisions that your trust scoring should be handling automatically. Below that range your thresholds are too permissive or your monitoring is missing things. Both failure modes are visible in the escalation rate before they are visible anywhere else.

The one number that tells your CFO everything in under thirty seconds is the governance-to-value ratio, total governance cost divided by total efficiency value generated by the governed processes. Now, when this ratio exceeds one, governance costs more than the automation saves, which means your agentification program is a net cost to the organization regardless of what the transformation deck says about strategic value.

Tracking this ratio is the evidence factory’s most important economic function and the one most likely to generate an interesting conversation with your finance director.


The evidence factory is not a product. Yet

No vendor sells the evidence factory as a complete, integrated, production-ready system. Several vendors sell components of it. Some of those components are excellent, but it all requires the same thing to close, someone who knows what they are building, has made the decision to build it properly, and is willing to do the integration work that the vendor ecosystem has not yet done for them.

That gap will close of course. Frontier labs have been focusing too much on scoring on the benchmarks, and less on the needs of their enterprise customers, but I’m seeing that the market is picking it up as we speak. The industry is now moving toward it with the urgency of an industry that has watched enough production databases get deleted to understand that governance infrastructure is not optional overhead. The MLOps market heading toward 20 billion dollars by 2032 is largely a bet on that closing gap. The acquisition wave that absorbed Protect AI into Palo Alto Networks, CalypsoAI into F5, Lakera into Check Point, and Truera into Snowflake is the consolidation pattern of a market that is assembling itself into integrated platforms. In one or two years the evidence factory will probably be something you can buy with an enterprise license and a professional services engagement.

Right now it is something you build. With composable open source components, targeted commercial platform purchases for the layers where commercial solutions are genuinely better than the open source alternatives, custom economic governance telemetry because no one has built that well yet, and the architectural discipline to keep the governance cost below the governance value at every layer of the stack.

The agentification programs that will survive the next two years of audit pressure, and board-level accountability questions are the ones that can open a dashboard, and show the evidence. The reasoning provenance records showing why each agent made each decision. The behavioral drift reports showing that the agents are operating within their intended boundaries. The economic governance telemetry showing that the cost of governing the program is proportional to the value it generates.

That is what the evidence factory produces. That is what “trust me, the agent is fine” does not.

I generated forty-four slides on this subject now. Some of them are even correct.

Signing off,

Marco

Eigenvector builds Agentification factories at scale, for production environments that actually have to pay-off, and Eigenvector Research occasionally publishes papers about why this is harder than the demos suggest.

👉 Think a friend would enjoy this too? Share the newsletter and let them join the conversation. LinkedIn, Google and the AI engines appreciates your likes by making my articles available to more readers.

Leave a comment