Beyond Hardware: What 18 Experiments with 14 Models Reveal About Harness, Calibration, and mcp-graph v11
I ran a factorial program with 18 cells, 14 models, and N=50 per cell to test whether adding harness at inference time improves agent quality. In 17 of 18 cases: no effect. What matters is the training objective.
Table of Contents
I spent months convinced that AI advances through harness, not chips. Then I ran 18 experiments to prove it. In 17 of them, the data disagreed with me.
That result bothered me for a few days. Then I realized what the data revealed is more useful than the confirmation I was looking for.
What harness is and why I thought it was the answer
Before diving into the numbers, I need to explain the central concept, because this post doesn’t make sense without it.
Harness is everything that surrounds the AI model. Not the model itself: the layer that connects it to the world. The instructions it receives, the tools it can use, the memory that persists between tasks, the protocols that let it act on external systems. When you use Claude Code, Cursor, or any modern AI CLI, you’re not just using the model. You’re using the model plus a sophisticated harness the tool injects.
The evidence that most convinced me of this thesis came in April 2026, when Anthropic released Claude Mythos Preview. The relevant data point: Anthropic compared Mythos against Sonnet 4.6 and Opus 4.6, both already strong frontier models, using the same scaffold and the same 7,000 entry points in security tests.
Sonnet and Opus produced 150 to 175 relevant results. Mythos produced 595, with capabilities the other models didn’t have. Same structure, different model. Capability emerged from the combination: model plus scaffold, not from the model alone.
That seemed to confirm my thesis dramatically. If harness is what unlocks latent capability, then building better harnesses should produce better agents, even without swapping the model.
My next question was natural: how much better does code get when I add more harness to an agent?
How I tested it
I ran 18 experiments over several weeks, covering 14 different models: from small ones at 4 billion parameters to large ones at 405 billion. Each experiment ran 50 task pairs under controlled conditions, with fixed seeds so anyone with the same code can reproduce the same results.
Total cost: under $50 in API spend. Not a frontier lab: a home lab with rigorous methodology.
The models covered include Qwen 2.5, Qwen3 MoE, Llama 3.3 70B, DeepSeek Chat in two versions, DeepSeek R1, Mixtral 8x22B, Hermes 3 405B, GPT-4o-mini, Claude Haiku 4.5, and Claude Haiku 3.5. The diversity across families was intentional: I wanted patterns that showed up across multiple vendors, not artifacts of a single model.
Adding more instructions does not produce better code
The first hypothesis was the most intuitive: if I add the Quality Sentinel to the agent’s prompt (an instruction asking the model to verify contracts, run tests, and inspect invariants before finalizing each task), the code it produces should improve.
Imagine a skilled chef who has worked in dozens of restaurants. You decide to hand them an extra recipe card before every dish. The card doesn’t change what they already know. The dish comes out the same.
That’s exactly what the data showed. In 17 of 18 model-and-configuration combinations, adding the Quality Sentinel to the prompt produced no detectable difference in code quality.
| Model family | Experiments | With positive effect | No effect |
|---|---|---|---|
| Qwen (chat) | 3 | 0 | 3 |
| Llama (chat-instruct) | 2 | 0 | 2 |
| DeepSeek (old chat) | 1 | 1 | 0 |
| DeepSeek (new chat) | 1 | 0 | 1 |
| DeepSeek (reasoning) | 1 | 0 | 1 |
| Mistral, Qwen MoE | 2 | 0 | 2 |
| Hermes 405B, Nous-Hermes | 2 | 0 | 2 |
| GPT-4o-mini | 1 | 0 | 1 |
| Claude Haiku 4.5 | 1 | 0 | 1 |
| Claude Haiku 3.5 | 1 | 0 | 1 |
There was one exception: the old DeepSeek Chat showed a real positive effect. But when I tried to replicate that result in four different variations, none confirmed it. That’s exactly the behavior you’d expect from a false positive, or from an artifact of a specific model version that the vendor removed in the next update.
The reason this happens makes sense in retrospect. Modern CLIs like Claude Code already inject a dense layer of harness before any instruction I add: permission systems, context compression, session memory, subagent delegation. When I add more instruction on top of that, I’m adding noise to a signal that was already saturated.
What actually matters: does the model know when it’s wrong?
While the first experiment tested extra prompt instructions, the second tested something completely different: the internal quality of each model.
Specifically, I tested whether each model is well-calibrated. Calibration is a statistics concept, but the intuition is simple: imagine a weather forecaster who, every time they say “70% chance of rain,” is right exactly 70% of the time. When they say “90% chance,” it rains 90% of the time. That forecaster is well-calibrated: the confidence they report matches reality.
A language model can do the same thing. When it completes a task, it produces an implicit or explicit confidence score. If the model says “I’m 90% confident” and is wrong half the time, it’s poorly calibrated. If the reported confidence matches the actual success rate, it’s well-calibrated.
Why does this matter for mcp-graph users? The project’s Autopilot uses the agent’s confidence score to decide whether to roll back a potentially destructive action. A poorly calibrated model says “I’m 90% sure” when it’s wrong: the agent doesn’t roll back and makes the mistake. A well-calibrated model says “I’m 60% sure” when there’s real uncertainty: the agent pauses and checks.
The data separated the 14 models into four distinct groups:
| Group | What characterizes it | Calibration | Models |
|---|---|---|---|
| Best | Trained to think aloud before answering | Excellent | DeepSeek R1, Qwen3 thinking |
| Good | Base model with minimal fine-tuning | Good | Mixtral 8x22B, Qwen3 base |
| Acceptable | Constitutional training (Anthropic) | Acceptable | Claude Haiku 4.5 |
| Poor | Heavy RLHF without calibration objective | Poor | Llama 70B, GPT-4o-mini, Claude Haiku 3.5, Hermes 405B |
The finding that surprised me: the largest model tested, Hermes 3 at 405 billion parameters, landed in the worst group. DeepSeek R1, with fewer parameters, landed in the best group. Size does not predict calibration. What predicts it is the training objective: models whose training process included explicit pressure for confidence to match reality end up well-calibrated. Models trained purely to seem correct, without that objective, end up poorly calibrated.
One detail that reinforces this: Claude Haiku 4.5 landed in the acceptable group, but Claude Haiku 3.5, the previous generation of the same family, landed in the poor group. Same vendor, consecutive generations, very different calibration. This means the claim “model X is well-calibrated” needs to specify which version. A model update can change this property without notice.
The formal protocol has a real cost
With those results in hand, I tested a different question: how much does the mcp-graph orchestration ritual itself (the start_task, TDD, finish_task command sequence) add over an agent already using Claude Code freely?
Think of a development team that decided, before every commit, the developer must fill out a four-field quality form. The form was created with good intentions: ensure the code was tested, contracts verified, reviewer signed off. But in practice, the team already does all of that automatically. The form doesn’t change the code. It just slows things down.
That was the result. I ran 20 coding tasks in two modes: one following the full mcp-graph ritual, one running Claude Code freely without mentioning the graph. The model was the same (Haiku 4.5) in both cases.
| Measurement | Difference (ritual vs. free) | Significance |
|---|---|---|
| Tokens consumed | +4,300 per task (+4.3%) | Yes |
| Execution time | +17 seconds per task (+18%) | Yes (strong) |
| Code quality | 0 | No difference |
Twenty of 20 tasks passed tests in both modes. The ritual added measurable cost without producing any quality gain.
This doesn’t invalidate mcp-graph as a project. What it invalidates is the enforcement ceremony on every task. About 70% of the project’s code, including the experiment database, memory system, semantic search, and provenance tracking, operates independently of the ritual. That core has real value and wasn’t what this experiment tested.
Subtasks need to talk to each other
After the results above, I tested a different idea: what if the value of the graph was not in instructions, but in coordination? If the graph decomposes a large task into smaller subtasks, and each subtask goes to a separate agent, that should produce a better result than a single agent trying to do everything.
Think of five people hired to write a book together. Nobody talks to anyone else. Each writes their chapter independently. The likely result: chapter 3 will reference a character chapter 2 killed, chapter 4 will be in the past while chapter 3 is in the future, and the index will point to pages that don’t exist.
That’s exactly what happened. I decomposed a coding task into 5 subtasks via mcp-graph. Each subtask received its acceptance criteria and nothing else. Result: 0 of 5 tests passed. One subtask created the file in calibration.ts. The next one looked for fitPlattParameters.js, which didn’t exist, because it had no idea the previous subtask had put the function somewhere else.
The graph was storing the subtasks correctly. The problem was that it wasn’t coordinating them: each agent operated in complete isolation.
The fix implemented in mcp-graph v11 is called assembleSiblingContext. Before each subtask starts, the system injects into the agent’s context the artifacts that previous subtasks produced: the files created, the functions defined, the decisions made. With this change, agents stop working in the dark.
In simulation with this fix, Haiku went from 0 of 5 to 5 of 5.
When the thermometer was the problem, not the fever
The last experiment was the most revealing about how benchmarks work, and the cheapest: $1.76 total.
The official mcp-graph v11 benchmark showed that Claude Haiku 4.5 passed only 20% of tests in the coordinated decomposition. My initial reading was the obvious one: Haiku is too limited for this task. I need a bigger model.
I ran seven experiments to investigate. Each one tried to isolate the cause as cheaply as possible.
The ladder of refutations:
First experiment: I simplified the configuration and used a different setup. Haiku passed 50% with a simpler setup, which is not 100%, but it refutes “Haiku is incapable”: the model clearly can do the task when conditions change.
Second experiment: I replicated the exact benchmark protocol, without any simplification, using the OpenRouter API outside the mcp-graph stack. Result: identical 20% to the benchmark. This confirmed the problem was in the test conditions, not in a bug in the implementation.
Third and fourth experiments: I removed multi-turn mode, then temperature 0.2. Both stayed at 20%. Neither the conversation protocol nor the model’s sampling was the factor.
Fifth experiment: I kept everything identical to the original protocol, but replaced two of the five tests with simpler versions that checked the same behavior without requiring numerical convergence of the Newton-Raphson algorithm. Haiku went from 20% to 100% across 5 independent runs.
The original tests T4 and T5 checked whether the model correctly implemented a mathematical optimization algorithm called Newton-Raphson. Haiku struggles with numerical convergence in that class of problem. The tests detected that struggle. But what we were trying to measure wasn’t “can Haiku implement Newton-Raphson.” It was “can Haiku decompose and coordinate code.”
The lesson that sticks: benchmarks don’t measure AI neutrally. They measure AI under the specific conditions of their tests. The same protocol with different tests can produce a result five times larger. Before concluding a model failed, it’s worth asking whether what the test measures is exactly what you want to evaluate.
What changes in mcp-graph v11 day to day
The experiments produced three concrete changes to the project.
Subtasks pass context to each other. The assembleSiblingContext injects each subtask’s artifacts into the next subtask’s prompt. The graph stopped being just storage and became a coordination medium.
The capability gate became advisory mode, not a blocker. The H12 experiment showed that the Haiku regression on the benchmark was a function of test specification strictness, not an inherent model limitation. The gate now warns about limitations without blocking execution.
Documentation re-centered on what has real value. The persistent components (database, memory, RAG, experiment tracking) received more emphasis. The enforcement ritual stepped back.
For choosing which model to use in each function, the framework the data supports is this:
| Agent function | Recommended model | Why |
|---|---|---|
| Planning tasks, decomposing PRDs | Claude Sonnet 4.6 | strategic reasoning quality |
| Executing fast atomic subtasks | Claude Haiku 4.5 | speed, sufficient calibration |
| Deciding whether to roll back an action (Autopilot) | DeepSeek R1 or Qwen3 thinking | reliable calibration for autonomous decisions |
| Debugging numerical algorithms | DeepSeek R1 | detects convergence failures |
The practical rule: any part of your agent that makes autonomous decisions based on confidence needs a model from the best or good group. Models from the poor group will report high confidence when they’re wrong. The Autopilot will act on that incorrect confidence.
What the data forces us to conclude
The original thesis, “harness matters more than hardware,” survives as practical guidance. If you’re waiting for a bigger model to get started, you can start now: the bottleneck is probably around the model, not inside it.
What the data refined is the question “what kind of harness?” Adding meta-cognitive instructions to the prompt of an already well-instrumented agent: no detectable effect. Adding orchestration ceremony on top of a CLI that already handles that: real cost, no gain. The harness that produces an empirically detectable effect lives in the model’s training objective, the persistent state the graph maintains between tasks, and the explicit coordination between agents operating in parallel.
This has a direct implication for anyone configuring agents. The most important variable isn’t which model is largest or which is cheapest. It’s which model has the right training objective for the function you want it to serve.
Are you choosing the models in your agent by price, by size, or by the type of function each one will perform?
References
- Anthropic. (2026). Claude Mythos Preview.
- Anthropic. (2026). Claude Mythos Preview Risk Report.
- AI Security Institute (AISI). (2026). Evaluation of Claude Mythos Preview’s cyber capabilities.
- Apollo Research. (2026). Forecasting Frontier Language Model Agent Capabilities.
- Kephart, J. O., & Chess, D. M. (2003). The Vision of Autonomic Computing (MAPE-K). IEEE Computer, 36(1), 41-50.
- Nogueira, D. (2026). mcp-graph-workflow v10.0.2. AGPL-3.0-or-later.
- Yang, J., et al. (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793.