AI Code Generators Build Working Software That Rots From Within
The more capable an AI coding model becomes, the more bloated and structurally tangled the code it produces — and neither better prompts nor functional correctness fixes the problem.
- What they did — Researchers audited code smells across two scales: LLM-generated solutions to algorithmic problems (CodeContest dataset) and full multi-file software systems built by the MetaGPT multi-agent framework, comparing structural quality against human baselines.
- Key result — They found a 'Volume-Quality Inverse Law' where total lines of code nearly perfectly predict architectural decay, and that functional correctness is decoupled from quality — working code is just as likely to be structurally flawed as broken code.
- Why it matters — Teams using AI agents to generate multi-file systems should treat architectural review as mandatory, since neither prompt engineering nor functional testing catches the structural debt these tools introduce.
The more capable an AI coding model becomes, the more bloated and structurally tangled the code it produces — and neither better prompts nor functional correctness fixes the problem.
The Problem
When we evaluate AI code generators, we almost always ask one question: does the code work? Benchmarks like HumanEval and MBPP measure functional correctness — whether the output passes test cases [§2.2]. But working code isn't the same as maintainable code. Human developers have long accumulated technical debt under time pressure, producing code that functions but becomes increasingly painful to modify [§1]. The question is whether AI agents do the same, or worse.
Prior work suggested AI might actually be *better* than humans here. Santa et al. found that LLMs produced fewer issues detected by SonarQube, a popular code analysis tool, resulting in lower estimated remediation time [§2.2]. But the researchers behind this new study argue that finding is misleading: SonarQube lumps together superficial formatting violations (missing whitespace, absent docstrings) with genuine structural design flaws [§2.2]. LLMs are excellent at formatting. That doesn't mean they're good architects.
This study isolates the deeper problem — structural and architectural code smells — and tracks how they behave as AI-generated software scales from single-file scripts to complex multi-module systems [§1].
What They Did
The researchers ran two experiments at different scales of complexity [§1].
In the first experiment, they had multiple LLMs solve algorithmic coding problems sampled from the CodeContest dataset — the kind of self-contained, single-file challenges common in competitive programming [§1]. This established a baseline for how different models handle code quality on simple tasks.
In the second experiment, they created product requirements and fed them to MetaGPT, a multi-agent framework that mimics a software company by assigning distinct roles (Architect, Engineer, QA Tester) to separate AI agents that collaborate to build full software systems with multiple files and modules [§2.3]. This let them study what happens when AI moves beyond isolated scripts into real architectural territory.
For detection, they used PyExamine, a static analysis tool that categorizes code smells at three levels [§2.4]. Code-level smells are localized problems within a single function — think of a method that's 300 lines long when it should be 30. Structural smells reflect flawed relationships between classes, like excessive coupling where changing one component forces changes in many others. Architectural smells are the most expensive: system-wide dependency tangles that require invasive refactoring to fix [§2.4].
By separating these categories, the study avoids the trap of prior work that mixed trivial linting issues with serious design flaws [§2.2].
The Results
The findings reveal what the authors call a "Reasoning-Complexity Trade-off" [Abstract]. On simple algorithmic tasks, more capable models like Qwen-480b inadvertently increase method bloat — cramming complex logic into single, monolithic procedural blocks (Long Method smells) — while humans tend to struggle more with state encapsulation (Temporal Fields) [§1]. The models solve harder problems but produce worse structure doing it.
As tasks scale to multi-file architectures, the smell profile shifts. Method-level bloat gives way to "God Class" patterns (Too Many Branches) and redundant implementations (Potential Improper API Usage) [§1]. The agents achieve what the authors term a "Modular Mirage": they split code across files, creating the appearance of modularity, but fail to establish meaningful semantic cohesion between modules [§1]. The files are separate; the logic is still tangled.
The most striking finding is the "Volume-Quality Inverse Law": total lines of code establishes a strong correlation with architectural decay [Abstract]. More code means worse structure, with high reliability. And critically, functional correctness is "decoupled from quality" — code that passes all tests is just as likely to contain structural flaws as code that fails [Abstract]. Running code provides no signal about maintainability.
Perhaps most concerning for practitioners: increasing requirement specificity in prompts — adding more detail, using few-shot examples — does not mitigate this degradation [Abstract]. The statistical analysis confirms that advanced prompting techniques fail to prevent architectural rot [§1].
The study's scope has clear boundaries. The algorithmic tasks come from a single dataset (CodeContest), and the multi-agent experiments use only MetaGPT [§1]. Whether these patterns hold across other agent frameworks like AutoGen or CrewAI, or on codebases with mixed human-AI authorship, remains untested — meaning teams can't yet rely on these specific smell profiles as universal diagnostics, and further cross-framework validation will be needed.
Why It Matters
This work reframes the central challenge of AI-based software engineering. The problem isn't getting AI to write code that works — it's preventing the code from becoming unmaintainable as it scales [Abstract]. The authors argue that future progress depends on "equipping agents with explicit architectural foresight" rather than better prompts [Abstract].
For teams today, the practical implication is concrete: if you're using AI agents to generate anything beyond isolated scripts, functional tests are insufficient quality gates. Code review focused on structural and architectural smells — coupling, cohesion, dependency patterns — becomes essential. This is a pattern emerging from controlled research, not yet reflected in production tooling. But the finding that prompt engineering doesn't help means the fix won't come from the prompt side. It will need to come from the tools themselves, or from human reviewers who know what to look for.