Your AI Agent Has a Code Quality Problem

I first started using AI coding agents as part of my role at Meta, working in one of the world’s largest code repositories. My teams were building Horizon Worlds: multiple clients (Quest headsets, phones, web) and servers, significant complexity, vast amounts of code written before AI tooling existed.

Working with that legacy code was challenging. We’d always framed complexity and code smells as problems for humans. But a whitepaper from CodeScene and Lund University shows they hit LLMs just as hard.

The study published in January 2026 tested six LLMs on 5,000 Python files and found that the quality of the code the AI operates on matters as much as the capability of the AI itself. Clean, well-structured code had 15-30% lower break rates during AI-driven refactoring. Messy code broke more often regardless of which model you threw at it.

Recently, I’ve been using tools like Claude Code much more outside of work and have a better comparison of what it’s like to work on smaller, greenfield projects. It’s still easy to generate a large mess (and I’ve not always been so focused on the code that’s generated to keep this under control), but I’ve found that these newer, smaller projects can typically stay factored in a way that enables LLMs to perform at their best.

Code quality, managing complexity and generally holding a high bar for engineering excellence matters as much now as it ever did. If your team is rolling out AI agents across a codebase with unaddressed complexity, you’re compounding risk with every diff they generate.

What the research found

Markus Borg, Nadim Hagatulah, Adam Tornhill, and Emma Söderberg took 5,000 competitive programming solutions in Python, split them into “Healthy” and “Unhealthy” based on CodeScene’s CodeHealth metric (a 1-10 score capturing structural code smells), then asked six LLMs to refactor each file. The test: did the original test suite still pass afterwards?

The results were consistent across every medium-sized model they tested. Healthy code (scoring 9 or above) broke significantly less often:

The best-performing medium-sized LLM (Qwen) passed tests on 80.7% of Healthy files vs 72.2% of Unhealthy ones
The weakest (Granite) managed 46.5% vs 37.2%
Risk reductions ranged from 15% to 30% depending on the model

Anthropic’s Sonnet 4.5 showed the same directional trend but pushed into higher overall success rates (86.8% vs 84.0%). And Claude Code, operating as a full coding agent, was the most conservative of all - around 95% success regardless of code health. But even Claude broke things. And the paper notes something uncomfortable: Claude sometimes stated “zero functionality changes” on files where it had, in fact, broken the tests.

What “healthy” means here

CodeScene’s CodeHealth metric was designed to identify anti-patterns that impact human ability to work effectively with code (in other words, code smells). The tool looks for specific issues and combines the findings into a file-level score. The five most common in the dataset were:

Functions with multiple blocks of nested if/for/while structures, suggesting missing abstractions. This was the most prevalent smell, appearing in 4,901 of the 5,000 files.
Too many independent execution paths through a function (high cyclomatic complexity). Found in 3,572 files.
Loops inside conditionals inside loops - three, four, five levels deep. Present in 2,433 files.
Boolean expressions that chain multiple logical operators together. Found in 1,328 files.
Functions taking so many parameters that their purpose becomes ambiguous. Found in 724 files.

Engineers have been flagging exactly these issues in code reviews for years. What’s new is the data showing that messy code doesn’t just slow humans down - it makes AI agents produce broken output too¹.

Compound Benefits

The researchers trained decision trees to predict whether a refactoring would break, using three features: CodeHealth, LLM perplexity (the model’s internal confidence), and lines of code. CodeHealth carried 3-10 times more predictive information than either alternative. And the learned thresholds clustered around CodeHealth = 9, which is the same boundary CodeScene had already calibrated for human comprehension.

The threshold where code becomes risky for AI is roughly the same threshold where it becomes hard for humans. The mechanisms are different - LLMs don’t get frustrated or lose concentration - but the outcome is the same. Structural complexity is the shared bottleneck.

This has a compounding effect. Organisations with healthy codebases get faster human development (already established in prior research) AND more reliable AI assistance. Organisations sitting on legacy complexity don’t just miss out on the AI benefits - they actively accumulate AI-generated bugs in the parts of the codebase that are hardest to debug.

As engineers write less code directly, the job shifts towards product thinking and architectural decisions. Code quality is part of that architectural responsibility - not just system boundaries and component diagrams, but the shape of the code itself. Reading code and recognising when it’s dropped below a quality threshold remain skills we need to keep sharp. Models will probably get better at this over time, but right now the accountability sits with us.

What you can do about it

The main fixes are the same ones we’ve been applying for years to improve code for humans. Some others have some AI specific nuance, especially if you’re considering how to remove humans from the loop in as many places as possible.

Measure before you deploy

Run complexity analysis on the code where you’re planning to use AI agents. The rough thresholds from the paper: cognitive complexity above 15 per function, nesting deeper than 4 levels, cyclomatic complexity above 15, or more than 5 function arguments should all raise flags. Most languages have static analysis tools that can surface these metrics - the specific tool matters less than making the measurement part of your workflow.

Stratify your codebase

Think about breaking up your codebase into “zones” where different levels of review and human scrutiny can be apportioned.

Low complexity files are safe ground - normal code review is enough (and in some cases, could be entirely automated with AI). Moderate complexity means a higher review burden; humans need to pay closer attention to the diffs. Files with deep nesting, high cyclomatic complexity, and long functions are human-first territory. If you want AI agents to work here safely, improve the code first. (The irony is that this is the sort of work that LLMs are not so great at.)

For all these changes there should be good automated tests running efficiently in the inner and outer loop. What do you mean, you don’t have tests?

Set up quality gates for AI output

AI agents can introduce new complexity, not just struggle with existing complexity. The paper found that all LLMs sometimes decreased CodeHealth even when the tests passed - meaning they produced working code that was structurally worse than what they started with.

Run the same complexity checks on AI-generated pull requests that you’d run on human-authored code. If an AI agent refactors a function and doubles its cyclomatic complexity in the process, that should fail review even if the tests pass. Code is code - treat all sources of change with equal scrutiny.

Target the worst offenders

With the current AI capabilities, no quick gains can be expected in the hairiest of old legacy parts.

The relationship between code health and AI reliability is linear - every improvement helps. But the biggest gains come from moving files out of the danger zone. Functions with cyclomatic complexity above 20, nesting five or six levels deep, or seven-plus arguments are the ones most likely to cause AI agents to produce broken output.

The fix is the same refactoring you’d do for human maintainability: extract functions, flatten nesting with early returns and guard clauses, simplify boolean expressions, introduce parameter objects. The difference is that now there’s a quantified reason to prioritise this work - it directly affects how reliably your AI tools perform.

The competitive angle

Thoughtworks flagged “AI-friendly code design” in their April 2025 Technology Radar. Gartner projects AI coding assistant adoption will reach 90% by 2028. If those numbers are even roughly right, the gap between teams with healthy codebases and teams without is about to widen significantly.

So far, the best AI-friendly patterns align with established best practices. As AI evolves, expect more AI-specific patterns to emerge, so thinking about code design with this in mind will be extremely helpful

AI-friendly code design

Teams that have been investing in code quality - paying down technical debt, enforcing complexity limits, refactoring legacy modules - are finding the work pays off twice. Their developers were already faster. Now their AI agents are more reliable too.

Teams that have been deferring that investment are finding the opposite. AI adoption on unhealthy code doesn’t rescue you from technical debt. It gives you a new, faster way to accumulate it.

The paper puts it bluntly: “code quality is a prerequisite for safe and effective use of AI.” I’d go further. For organisations betting on AI-assisted development as a competitive advantage, code health is the foundation the entire bet rests on.

(If you’ve seen this play out in your own codebase, I’d be curious to hear about it.)

Claude Code’s refactorings were often minor or cosmetic, with no predictable pattern to whether it went bold or conservative and there were multiple occasions where it claimed things worked but tests were actually broken. ↩