· 11 min read · ...

Hallucination comes from code, not the model: how constraint engineering accelerates time-to-market with agents

Hallucination comes from code, not the model: how constraint engineering accelerates time-to-market with agents

In teams using autonomous agents, the biggest hallucination driver isn't the LLM. It's the absence of contracts, gates, and traceability in code.

AIagentsengineeringhallucinationMCParchitecturemcp-graph
Table of Contents

It’s not the model’s fault

Every time an autonomous agent invents a state that doesn’t exist, completes a task without evidence, or overwrites another agent’s work, the instinctive reaction is the same: “the model hallucinated.” And yes, language models do hallucinate. That is an intrinsic statistical property of any system that generates text token by token from probability distributions.

But here’s the thesis I defend after months of building multi-agent systems in production: in development teams using agents, the biggest driver of operational hallucination is not the LLM itself. It’s the absence of constraint engineering in the code and in the process. When engineering creates contracts, gates, traceable context, and concurrency control, the operational hallucination rate drops dramatically and time-to-market goes up.

This article will prove it with real code, practical theorems, and a methodology you can apply tomorrow on your team.

What “hallucination” means in this context

First, I need to define what I’m calling hallucination. I’m not talking about grammatical errors or the model confusing a date. I’m talking about systemic failure that occurs when an autonomous agent operates within a development pipeline. Specifically, five types of failure:

  1. Inventing project state. The agent claims a task is in progress when it hasn’t even started. Or that a file exists when it was deleted three commits ago. This happens when there’s no queryable source of truth and the agent fills gaps with inference.

  2. Completing a task without real criteria. The agent marks something as “done” without any test passing, no acceptance criteria verified, no evidence produced. It’s the equivalent of a developer merging a PR without review.

  3. Using irrelevant context. The agent receives a huge, unfiltered context and ends up using snippets that have no relation to the task. The response looks coherent on the surface but is grounded in wrong information.

  4. Overwriting another agent’s work. In multi-agent scenarios, two agents edit the same file or modify the same state without coordination. Both are “correct” locally, but the global result is inconsistent.

  5. Moving something to “done” without minimum quality. The task is moved to its final status, but the code doesn’t compile, tests don’t pass, or the functionality doesn’t meet the original requirement.

Notice the pattern: none of these failures are exclusively the model’s fault. They are all system failures that occur because the code surrounding the model doesn’t impose sufficient constraints. The model does what models do (generate the most likely continuation). The problem is that nobody defined what “likely” should mean in that context.

Practical proof: the mcp-graph codebase

To make this discussion concrete, I’ll use the codebase of mcp-graph, an open-source tool that implements task graphs with execution pipelines for autonomous agents via MCP. The implementation was designed specifically to reduce hallucination through engineering. Each mechanism below is evidence that the system forces the agent to operate within verifiable boundaries.

Code evidenceWhat it enforcesEffect on hallucination
start_task assembles next + context + RAG + TDD + status (src/mcp/tools/start-task.ts, src/core/pipeline/start-task.ts)Standardized execution inputLess context improvisation
finish_task runs DoD + blocks done if required checks fail (src/mcp/tools/finish-task.ts, src/core/pipeline/finish-task.ts)Controlled outputLess “false done”
DoD with 9 checks and score/grade (src/core/implementer/definition-of-done.ts)Objective completion criteriaLess conclusion without evidence
AC validation via INVEST + testable/measurable parser (src/core/analyzer/ac-validator.ts, src/core/analyzer/ac-parser.ts)Requirements become testable contractsLess ambiguity
RAG context with cache, dedup, rerank, citations (src/mcp/tools/context.ts, src/core/rag/post-retrieval.ts, src/core/rag/citation-mapper.ts)Traceable evidence and token budgetLess “textual guessing”
Unified gate blocking calls with phase/prerequisite errors (src/mcp/unified-gate.ts)Mandatory flowLess process deviation
Lock by lease + explicit conflict (src/core/store/lock-manager.ts)Mutual exclusion between agentsLess state conflict
Optimistic locking by version (src/core/store/sqlite-store.ts)Detects writes to stale stateLess silent overwrite
Harness with 7 dimensions and hallucination risk alert (src/core/harness/harnessability-score.ts, src/core/harness/harness-preflight.ts)Measurable structural qualityLess recurring systemic error

Each row in this table is a maze wall. And each wall exists because, without it, the agent found a shortcut that produced incorrect results.

The input and output pipeline

The most important mechanism is the start_task / finish_task pair. When the agent calls start_task, the system doesn’t just deliver the task text. It assembles a complete package: the next task in the queue (respecting dependencies), relevant context filtered by RAG, parsed acceptance criteria, current graph state, and expected tests. The agent receives everything it needs to execute, and nothing that could confuse it.

On the output side, finish_task doesn’t accept a “done” without verification. The system runs the Definition of Done (DoD), which includes 9 mandatory checks: test existence, test passing, minimum coverage, acceptance criteria compliance, code quality, absence of regressions, among others. If the required checks don’t pass, the task does not advance. The agent is forced to fix issues before proceeding.

This eliminates completion hallucination. The agent may “think” it’s finished, but the system only confirms with evidence.

Context with mandatory citation

The mcp-graph RAG module isn’t naive retrieval that dumps text into the prompt. It implements caching to avoid reprocessing, deduplication to remove redundancy, reranking to prioritize relevance, and crucially, citation mapping. Each context snippet used by the agent must point to its original source.

This matters because it transforms the model from “text generator” into “evidence synthesizer.” When the agent needs to cite the source of each claim, the space for invention drops dramatically. It’s like the difference between writing an opinion essay and writing an academic paper with references.

Safe concurrency

In multi-agent scenarios, the lock-manager implements mutual exclusion by lease: when an agent is working on a task, other agents cannot modify the same resource. If the lease expires (because the agent crashed or was interrupted), another agent can take over. And optimistic locking by version in SQLite ensures that, even in race condition scenarios, a write to stale state is detected and explicitly rejected, rather than silently overwriting.

Without these mechanisms, two agents can simultaneously be “correct” in their local views of state while producing a globally inconsistent result. That’s hallucination by system design, not by model failure.

Four practical theorems

Based on the evidence above, I propose four theorems that formalize the relationship between constraint engineering and operational hallucination. I call them “theorems” not in the formal mathematical sense, but as propositions that emerge from practice and that you can verify empirically.

Theorem 1: Constraint Density

Statement: the higher the density of executable constraints (schema + gate + DoD), the lower the operational hallucination.

Intuition: the agent can “imagine” anything, but it cannot persist imagination when the system requires passing through required checks. Each constraint is a filter that separates valid output from hallucinated output. A single constraint can be circumvented; a dense network of constraints creates a force field that converges the agent toward correct behavior.

Practical example: in mcp-graph, even if the model generates an invalid status, the unified-gate blocks the transition. Even if the model claims tests passed, the DoD actually runs the tests. The hallucination may be generated, but it never persists.

Theorem 2: Evidence with Citation

Statement: context with multi-strategy retrieval + post-processing + citation reduces factual error more than raw context.

Intuition: when every piece of response needs to point to a verifiable source, the space for invention shrinks. The agent becomes “bound” to available evidence. It’s the difference between giving someone an entire library without an index (raw context) and giving a curated set of references with page numbers (context with citation).

Practical example: the mcp-graph RAG pipeline does retrieval, dedup, rerank, and citation mapping. The agent doesn’t receive “everything that might be relevant.” It receives what is relevant, without duplication, ordered by importance, with mandatory citation. The “textual guessing” rate drops because the model has less room to invent.

Theorem 3: Flow Integrity

Statement: if status only advances through valid transitions and completion requires DoD, the rework rate drops.

Intuition: blocking “shortcuts to done” reduces hidden technical debt and accelerates net delivery. It seems counterintuitive (more gates = slower?), but in practice the opposite happens. Without gates, tasks are closed prematurely, bugs appear later, rework consumes twice the time. With gates, tasks take slightly longer to close, but they actually close.

Practical example: the mcp-graph unified-gate prevents an agent from calling finish_task if the task hasn’t passed phase prerequisites. This eliminates the scenario where a “ready” task actually needs more work, saving entire cycles of reopening and fixing.

Theorem 4: Safe Concurrency

Statement: resource locking + optimistic versioning reduce inconsistency between parallel agents.

Intuition: without concurrency control, two agents can be “correct” locally (each sees a consistent state) and wrong globally (the combined result is inconsistent). Lease-based locking ensures mutual exclusion. Optimistic versioning ensures conflict detection. Together, they eliminate the entire class of hallucinations caused by partial state visibility.

Practical example: in mcp-graph, if Agent A is editing Task 5 and Agent B tries to edit the same task, the lock-manager explicitly rejects the second operation. If two agents read the same version and one writes first, the second receives a version conflict error instead of silently overwriting.

The M.A.P.A. methodology: a framework for reducing hallucination

Based on the theorems above, I propose the M.A.P.A. methodology, an acronym for four practices that, combined, create the containment infrastructure that transforms agents from “text generators” into reliable executors.

M: Model contracts

Define strict schemas for everything the agent manipulates: nodes, edges, status, transitions. Require measurable and testable acceptance criteria. Don’t let the agent decide what is “sufficient”; codify that in validations.

In practice, this means using schema validation (Zod, JSON Schema, or equivalent) for every agent input and output. If the agent needs to create a task, the schema defines which fields are required, which values are valid, and which state transitions are allowed.

A: Attach flow gates

Phase, prerequisites, and status are not optional “best practices.” They are runtime checks that code executes at every transition. A gate is a function that receives the current state and proposed action, then returns “allowed” or “blocked with reason.”

In mcp-graph, the unified-gate centralizes all these checks in a single point. This ensures no tool call skips a mandatory step, regardless of which agent is operating or which prompt was used.

P: Prove with pipeline

Use input (start_task) and output (finish_task) pipelines to standardize execution. The input pipeline assembles complete, filtered context. The output pipeline validates results against objective criteria.

This transforms each task execution into a reproducible process: same input, same validation, comparable results. The pipeline is the executable contract between “what the agent received” and “what the agent must deliver.”

A: Audit quality continuously

Run harness and quality gates to detect regression before it explodes in production. The mcp-graph harness score evaluates 7 dimensions of structural quality and emits hallucination risk alerts when a dimension falls below threshold.

Continuous auditing closes the loop: contracts (M) define the rules, gates (A) enforce them, the pipeline (P) standardizes execution, and auditing (A) verifies that everything continues working over time.

How to prove this on your team: a two-week experiment

If you want to validate this thesis in practice, I propose a simple A/B experiment that can be run in two weeks:

Group A (control): agent with free flow, no strong gates, no mandatory DoD, no concurrency control. The agent receives the task and has full autonomy to execute however it wants.

Group B (engineering): same agent, but with strict schema, start_task/finish_task, DoD with required checks, RAG with citation, and lock/versioning.

Metrics to compare:

MetricWhat it measuresExpectation
Lead timeTotal time from request to deliveryB lower (less rework)
Cycle timeActive execution timeB slightly higher (gates add overhead)
Task reopen rateTasks that went from “done” back to “in progress”B significantly lower
Post-merge defectsBugs found after code entered the main branchB lower
% of “done” blocked by DoDHow often DoD prevented a premature doneB exclusive (indicator of prevented hallucinations)
Harness regressionDrop in structural quality score over timeB stable, A degrading

Success criteria: if B delivers faster (lower lead time), with less reopening and fewer escaped defects, your thesis is empirically proven. The bottleneck was engineering, not the model.

In my experience, B tends to have slightly higher cycle time per task (because gates add overhead), but significantly lower total lead time (because there’s no rework). The net result is more real throughput at lower total cost.

Why this matters for time-to-market

The business argument is direct: rework is the silent killer of time-to-market. When an agent “delivers fast” but the delivery needs to be redone, fixed, or completed manually, the total time is greater than if the delivery had been slower but correct.

Constraint engineering trades apparent speed for real speed. The agent takes slightly longer per task (because it needs to pass through gates), but each delivered task is truly delivered. No surprises in review. No unexpected rollback. No entire sprint consumed fixing what an agent “delivered” in the previous sprint.

Companies like Stripe, Datadog, and Salesforce reached this same conclusion through different paths. Stripe with their Minions and the “execution envelope.” Datadog with harness-first engineering and deterministic testing. Salesforce with Agent Graph and finite state machines. The pattern converges: model intelligence only converts to business value when wrapped in deterministic engineering.

Conclusion

Models hallucinate. That’s a fact. But in production, what defines hallucination’s impact is the containment architecture around the model.

When code enforces contracts, state trails, traceable evidence, and safe concurrency, the agent stops being a “lucky text generator” and becomes a reliable executor. Hallucination may still be generated internally by the model, but it never persists in the system because each gate, each check, each validation works as a filter separating valid output from invented output.

The business result is clear: less rework, less rollback, more real throughput. In other words, lower time-to-market with greater safety.

The next time your agent “hallucinates” in production, before swapping the model or tweaking the prompt, look at the surrounding code. Ask: is there an input contract? Is there output validation? Is there concurrency control? Is there context traceability?

If the answer is “no” to any of those questions, the problem isn’t the model. It’s the engineering.

And the good news is that engineering, unlike models, is 100% under your control.

Comments

Loading comments...

Leave a comment

Related posts

Stay in the loop

Get articles about software architecture, AI and open source projects delivered to your inbox.