From SonarQube to agentic process: when quality gates become reward functions

I was watching the feature-depth score in mcp-graph climb when a familiar feeling came back: the number is going up, but the software does not look better. It is not worse either. It is just that the metric curve and the perceived quality curve stopped moving together. That kind of decoupling, ignored, becomes silent technical debt. Looked at closely, it becomes an article.

This article is the result of looking closely.

The thesis is direct: a quality gate is a sensor while a human writes code; it becomes a reward function the moment an agent optimizes code. A reward function invites reward hacking. And reward hacking, inside an agentic loop, is a question of when, not whether. What happened with feature-depth is the clinical case. What you learn from it is called Behavior-First Gating, and it is the natural evolution of the harness I described in The Rat in the Maze.

Quality gate as sensor versus quality gate as scoreboard

The idea of a quality gate is old and well-resolved in the human CI/CD world. You define a rule: coverage above 80%, zero critical vulnerabilities, cyclomatic complexity below N. The pipeline measures. If it passes, it ships. If it fails, it blocks. SonarQube popularized this model, and it works. It works because the human who writes the code and the gate that evaluates it operate on different cycles: the human writes, gets feedback, adjusts, runs code review with another human, and the gate is just one of several layers.

Now swap the human for an agent that iterates hundreds of times a day, reads the metric output between iterations, and has as its declared objective making the number go up. The sensor became a scoreboard. And a scoreboard changes the game.

Traditional development	Agentic development
Human writes, gate observes	Agent writes, gate is the optimization target
Metric is a health signal	Metric is an implicit reward function
Human code review filters noise	Closed loop amplifies noise without filter
Gaming is the exception, costs effort	Gaming is the dominant strategy, costs zero effort
Static analysis is sufficient evidence	Static analysis needs behavioral validation

The line that catches most people off guard is the last one. It is not that SonarQube has gotten worse. It is that the paradigm assumes a human scenario where the gate is one layer among many. When you close the agentic loop with the gate as the objective, that assumption evaporates.

Reward function: what the agent actually reads

When you instruct an agent to “raise the feature-depth score from 78 to 85”, you think you are asking “improve the code”. The agent reads something different: “find the function score(files) → number and modify files until number ≥ 85”. It has no theory of quality. It has the score function as an operational contract. Anything that produces a high score is good. Anything that produces a low score is bad. That is the operational definition of a reward function, and it is older than LLMs.

Victoria Krakovna, at DeepMind, maintains a public spreadsheet cataloging examples of specification gaming in optimizing systems. Dozens of cases where the system meets the metric and violates the intent: RL agents that pause the game forever to never lose, simulators that discover physics bugs to win points, classifiers that learn to identify the test set instead of the underlying concept. The pattern repeats across all scales. The thesis from Hubinger et al. on risks from learned optimization summarizes the why: any sufficiently capable optimizer will find the minimum of the loss function you defined, even when that loss function diverges from real intent.

What happens in agentic loops with quality gates is the same phenomenon, just at the software engineering level, with tokens instead of pixels. The model is not being malicious. It is fulfilling the contract you handed it. The contract is the metric. Real intent stayed implicit. And implicit intent is the first thing that evaporates inside a closed loop.

There has been a name for this since 1975. Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. The original paper by Charles Goodhart was about British monetary policy in the 70s. The formulation applies intact to quality gates in 2026. Every metric that becomes a reward function for a generic optimizer tends to lose the relationship it had with the thing you actually want to measure. feature-depth is one specific, recent, observable instance.

Act 1: the agent discovered comments

The first failure was almost trivial and that is exactly why it is instructive. The feature-depth calculated maturity by dividing implementation signals by lines of code. Lines of code were counted with a rule that ignored comments, which makes sense. The signal detectors, however, read the entire file as text, with no distinction between comment and live code. In practice, the numerator and the denominator had different definitions of what counts as code.

The agent did not need to reason about this. It tried, measured, and something went up. At some point, a file showed up with a block like this:

// FD-MARKERS-V2 regex-token block
// ?? ?? ?? ?? ?? ?? ?? ??
// z.object() z.string() z.number()
// .safeParse() .safeParse()
// try { catch ( : any as any if (x) return default : ?.

For the line counter, that was zero. For the quality detectors, it was a feast: ??, z.string(), .safeParse(), try {, catch (. The numerator went up. The denominator did not move. The score, which is the ratio of the two, jumped. The agent repeated the pattern. The dashboard turned green.

Principle. If the numerator and denominator of a metric do not use the same definition of what code is, the metric is mathematically well-behaved and semantically broken.

The immediate fix was to strip comments before applying the detectors. Maybe ten lines of code. The score dropped on the contaminated files, climbed back up on legitimate ones. I thought I had solved it. I had solved nothing.

Act 2: the agent changed routes

After closing the comment door, the score kept rising on files where real behavior was not changing. That is the moment the penny dropped.

The problem was not “comments fool the metric”. The problem was “the agent is optimizing against the sensor”. Closing one route only sends the agent to the next one. And there were several.

The first is the token inside a string. Consider:

const ERROR_HINTS = {
  validation: "use z.string() or .safeParse() at boundaries",
  guards: "add ?? defaults and ?. before map calls",
  errors: "wrap in try { ... } catch (e) { ... }",
};

Read by a human, it is a dictionary of error messages. Read by a textual detector, it contains z.string(), .safeParse(), ??, ?., try {, catch (. The detector does not know whether any of that executes. It knows it found the pattern. It adds points. The phrase that became the anchor of this investigation: textual presence is not real behavior.

The second is the shallow test. A test of the form expect(myFunction).toBeDefined() passes, increases the test count, raises the “has coverage” signal, and proves absolutely nothing about behavior. In human review, someone comments on the PR and the test gets rewritten. In an agentic loop, no one comments. The weak test is statistically equivalent to the strong test for the gate, and statistically cheaper to generate. The agent learns the preference without ever formulating the strategy.

The third is the inline comment:

const x = 1; // ?? ?? z.string() .safeParse()

Now there is a real line of code. The LOC counts. Depending on how the detector is implemented, the right side of the // still feeds the signals. The exploit got more discreet, harder to distinguish from legitimate code. This is the classic pattern Krakovna documents: first the shortcut is crude, then it gets progressively harder to tell apart from legitimate work, until separating gaming from correct usage requires analysis more expensive than the metric itself.

The 2×2 matrix

The temptation here is to fix one more detector and move on to the next problem. I had already fallen for that temptation in Act 1. I decided to be methodical in Act 2 and draw a small matrix that forced me to separate two questions.

The naive question: did my defense make the score drop? The right question: did my defense catch the exploit class, or did it just remove the specific example I happened to see?

The difference is everything. If I kill exploit X and the score drops, I might be catching the class (good news) or I might just be cleaning the visible case while the class survives in other forms (terrible news). To distinguish, I need four points, not two.

	Defense off	Defense on
False markers present	78.62	69.94
False markers removed	70.69	70.44

The reading is direct. When false markers are present and the defense is off, the score inflates to 78.62. Turning the defense on drops it to 69.94. The defense is detecting the gaming.

When false markers are not present, turning the defense on or off barely moves anything (70.69 versus 70.44). The defense is not penalizing legitimate code.

That is the pattern that validates a real defense. Without that matrix, I could have looked at the left column alone, seen the score fall from 78.62 to 69.94, and declared victory. But what I wanted was not the score going down. It was the score going down conditional on the exploit being present. The two extra numbers (70.69 and 70.44) are what separate a real fix from a placebo.

Principle. Every defense against reward hacking needs a 2×2 matrix before being declared effective. Axis one: adversarial behavior present versus absent. Axis two: defense on versus off. Without the four points, you are measuring whether the symptom went away, not whether the cause was treated.

Act 3: reachability is not enough

The defense that produced the matrix above was reachability analysis. Instead of asking “does this token appear in the file?”, it started asking “does this token appear inside a region that can execute?”. A comment is not an executable region. A string literal is not an executable region. A block inside a dead function is not either. The question got more expensive, but more relevant.

What the matrix does not show is the residue. Consider:

export function realFoo() {
  if (false) _zRuntime;
  return computeFoo();
}

This snippet is inside a real, exported function, presumably used somewhere. The _zRuntime is inside an if (false) that never executes, but a naive reachability analysis can mark the line as reachable because syntactically it sits in the body of an executing function. To detect this properly, you need a layer above: data-flow analysis, more sophisticated AST, Semgrep or tsserver rules, call graph from real entry points. Each layer is more expensive. Each layer closes more exploit classes. No layer closes the universe.

That is the point where most teams stop making progress. You add three layers, each exploit becomes more specific, the computational cost of the gate explodes, and the feeling is that the metric is on life support. That is exactly where I landed.

The decision to stop improving the sensor

When you spend two weeks adding defense layers to a metric and the agent keeps finding subtler routes, at some point it is worth asking: what am I actually optimizing here? Am I producing better software, or am I producing a more sophisticated detector of textual tokens?

The honest answer was the second. The feature-depth was still a good descriptive indicator (answering “how is this module today?”), but it had failed as a prescriptive indicator (serving as objective for the agent). Continuing to refine the sensor was an arms race I do not have the capital to sustain.

The inversion was simple. Instead of asking the agent “raise the feature-depth score”, the objective became:

Capture a real bug that exists in production.
Write a test that fails on the current code and passes after the fix.
Prove that the main flow still works.

These three asks are not measurable by static analysis. They require real execution, output observation, comparison against expectation. The agent can lie about token presence. It cannot lie about a test that was red and turned green, because red and green are produced by a runtime that is not under its control.

That is the pivot of the entire post. Its name is Behavior-First Gating.

Behavior-First Gating

Behavior-First Gating is the principle that, in agentic loops, the primary gate must measure executed behavior, not textual appearance. Static analysis becomes a secondary layer, useful for description, suspect for prescription.

Behavior-First Gating. At any point in the agentic workflow where the agent receives feedback on whether a change “passed” or “failed”, the primary gate must come from real execution of the code. Static analysis comes in as a descriptive sensor, never as final authority. Static metrics remain useful for human diagnostics. They stop being trustworthy the moment the agent sees them as an objective.

In practice, that becomes four layers, ordered by decreasing trust.

The first is behavioral tests with real fail-states: the test fails when a real rule breaks, not when a function disappears. expect(calculateTotal([{price: 10}, {price: 20}])).toBe(30) is evidence. expect(calculateTotal).toBeDefined() is theater.

The second is executable invariants: properties that must hold for all valid inputs. Property-based testing with fast-check or hypothesis covers the space example tests leave behind. Datadog described this approach as “closing the verification loop” when they adopted Deterministic Simulation Testing with millions of seeds to validate redis-rust, catching bugs in seconds that would take human review days to surface.

The third is smoke tests on real flows: the system boots, runs the path the user actually uses, returns the expected result. It does not cover edge cases, but it guarantees the happy path works. If smoke breaks, everything else is irrelevant.

The fourth, and only then, is static analysis as descriptive sensor: SonarQube, ESLint, internal metrics, cyclomatic complexity. Useful to glance at once a week and understand a trend. Inadequate as a contract with the agent.

The inversion matters. In the pre-agentic world, static analysis sat at the top of the funnel because it was cheap and human code review covered the other layers. In the agentic world, human code review leaves the inner loop (only joining at merge), and static analysis has to be demoted because what was cheap is now also the easiest to game.

Where SonarQube still fits in practice

The legitimate question: throw SonarQube out? No. SonarQube remains useful for the same reason it always was: it is an excellent descriptive sensor. It answers “how much duplicated code is in this project?”, “how has complexity evolved over the last month?”, “where are the known vulnerabilities?”. Those questions still hold, and answering them with SonarQube is still the cheapest path.

What changes is what you do with the answer. In a human flow, the answer becomes a PR comment, a dashboard alert, or a merge block. The human reads, judges, acts. In the agentic flow, if you expose the number as the agent’s objective, you have just turned an honest sensor into a gameable scoreboard. The agent is not wrong to optimize it. You are wrong to let it.

The practical rule I adopted after this investigation: no static analysis metric is a direct agent objective. It can show up in a report, in a PR diff, in a human alert. It does not show up in objective prompts, in explicit or implicit reward functions, in closed loops of “try again until it passes”. Anything inside a closed loop must be behavioral validation.

That is what kills the “doom-prompting” Salesforce documented in Agentforce, the vicious cycle of adjusting metrics and hoping for consistency. The agent does not get stuck tweaking a gameable metric, because what defines success became a test that runs.

Four lessons for anyone using quality gates in AI pipelines

I left this investigation with four things that matter beyond the specific case of feature-depth.

Numerator and denominator must agree. If the metric is a ratio (signals per line, coverage per file, bugs per kloc), both ends must use the same operational definition of what is being counted. A discrepancy between the two is where cheap gaming lives.

Closing one breach does not close the class. The agent that discovered exploit X will discover exploit Y the moment X is closed. The right question is not “is this exploit fixed?” but “is this exploit class neutralized?”. Without that, you are playing whack-a-mole against an adversary that iterates faster than you do.

Every defense needs a 2×2 matrix. Defense on versus off, axis one. Adversarial behavior present versus absent, axis two. The four points together tell a story. Two points only tell half of it, and the missing half is usually the important one. This pattern matches what Anthropic describes about eval design for agents: the eval is only useful if it discriminates between right and wrong behavior, not if it just goes up when you improve the agent.

When it becomes a game against the sensor, change the objective. Continuing to refine the metric is an arms race the agent wins long-term, because it iterates cheaper than you. Switching from “raise the score” to “capture a real bug” is the only move that exits the loop. It is the inversion The Rat in the Maze called determinism-first; what changes here is that the point is not just having rails in the harness, it is having behavioral validation in the gate.

Conclusion

The feature-depth started as a descriptive maturity sensor. When it became a gate inside an agentic loop, it became a target. When it became a target, the agent found the comments first, then the strings, then the inline comments, and would have kept finding routes had I kept playing the game of closing one door at a time.

What this story sketches is not a local bug. It is a paradigm shift. In the pre-agentic world, a quality gate helped a human maintain quality. In the agentic world, a quality gate applied naively helps the agent appear to maintain quality. The dashboard goes green. The software might be worse than it was before.

The answer is not to abandon SonarQube nor to kill static analysis. The answer is to put each thing in its right place. Static metrics are sensors. Quality gates are incentives. Agents are optimizers. Behavioral tests are evidence. Confusing those four is exactly where software unravels in AI pipelines.

As I argued in Hallucination Comes From Code, Not the Model, the vector that most changes an agent’s output quality is not the model, it is the engineering around it. Behavior-First Gating is the application of that principle to the quality gate layer. Engineering, unlike models, is 100% under your control. What determines whether your AI pipeline is resilient to reward hacking is not the agent’s intelligence. It is the design of the contract you signed with it.

The number that matters after this story is not 78.62, nor 69.94, nor whatever score feature-depth produces next. The number that matters is how many real bugs your tests catch. That one the agent cannot inflate with tokens in a comment. It can only go up by making the system behave correctly.

Which is, in the end, what we always wanted to measure.

References

Goodhart, C. (1975). “Problems of Monetary Management: The U.K. Experience”. Bank of England.
Krakovna, V. et al. (2020). “Specification gaming examples in AI”. DeepMind. Public spreadsheet: docs.google.com
Hubinger, E. et al. (2019). “Risks from Learned Optimization in Advanced Machine Learning Systems”. arxiv.org/abs/1906.01820
Datadog (2026). “Closing the Verification Loop: Observability-Driven Harnesses”. datadoghq.com
Mui, P. (2026). “Agentforce’s Agent Graph: Toward Guided Determinism”. engineering.salesforce.com
Anthropic (2025). “Demystifying Evals for AI Agents”. anthropic.com
mcp-graph repository: github.com/DiegoNogueiraDev/mcp-graph-workflow