Signal #8 — 2026-W19

AI Code Generators Build Working Software That Rots From Within

Yuecai Zhu, Nikolaos Tsantalis, Peter C. Rigby

The more capable an AI coding model becomes, the more bloated and structurally tangled the code it produces — and neither better prompts nor functional correctness fixes the problem.

What they did — Researchers audited code smells across two scales: LLM-generated solutions to algorithmic problems (CodeContest dataset) and full multi-file software systems built by the MetaGPT multi-agent framework, comparing structural quality against human baselines.
Key result — They found a 'Volume-Quality Inverse Law' where total lines of code nearly perfectly predict architectural decay, and that functional correctness is decoupled from quality — working code is just as likely to be structurally flawed as broken code.
Why it matters — Teams using AI agents to generate multi-file systems should treat architectural review as mandatory, since neither prompt engineering nor functional testing catches the structural debt these tools introduce.

The more capable an AI coding model becomes, the more bloated and structurally tangled the code it produces — and neither better prompts nor functional correctness fixes the problem.

The Problem

When we evaluate AI code generators, we almost always ask one question: does the code work? Benchmarks like HumanEval and MBPP measure functional correctness — whether the output passes test cases [§2.2]. But working code isn't the same as maintainable code. Human developers have long accumulated technical debt under time pressure, producing code that functions but becomes increasingly painful to modify [§1]. The question is whether AI agents do the same, or worse.

Prior work suggested AI might actually be *better* than humans here. Santa et al. found that LLMs produced fewer issues detected by SonarQube, a popular code analysis tool, resulting in lower estimated remediation time [§2.2]. But the researchers behind this new study argue that finding is misleading: SonarQube lumps together superficial formatting violations (missing whitespace, absent docstrings) with genuine structural design flaws [§2.2]. LLMs are excellent at formatting. That doesn't mean they're good architects.

This study isolates the deeper problem — structural and architectural code smells — and tracks how they behave as AI-generated software scales from single-file scripts to complex multi-module systems [§1].

What They Did

The researchers ran two experiments at different scales of complexity [§1].

In the first experiment, they had multiple LLMs solve algorithmic coding problems sampled from the CodeContest dataset — the kind of self-contained, single-file challenges common in competitive programming [§1]. This established a baseline for how different models handle code quality on simple tasks.

In the second experiment, they created product requirements and fed them to MetaGPT, a multi-agent framework that mimics a software company by assigning distinct roles (Architect, Engineer, QA Tester) to separate AI agents that collaborate to build full software systems with multiple files and modules [§2.3]. This let them study what happens when AI moves beyond isolated scripts into real architectural territory.

For detection, they used PyExamine, a static analysis tool that categorizes code smells at three levels [§2.4]. Code-level smells are localized problems within a single function — think of a method that's 300 lines long when it should be 30. Structural smells reflect flawed relationships between classes, like excessive coupling where changing one component forces changes in many others. Architectural smells are the most expensive: system-wide dependency tangles that require invasive refactoring to fix [§2.4].

By separating these categories, the study avoids the trap of prior work that mixed trivial linting issues with serious design flaws [§2.2].

The Results

The findings reveal what the authors call a "Reasoning-Complexity Trade-off" [Abstract]. On simple algorithmic tasks, more capable models like Qwen-480b inadvertently increase method bloat — cramming complex logic into single, monolithic procedural blocks (Long Method smells) — while humans tend to struggle more with state encapsulation (Temporal Fields) [§1]. The models solve harder problems but produce worse structure doing it.

As tasks scale to multi-file architectures, the smell profile shifts. Method-level bloat gives way to "God Class" patterns (Too Many Branches) and redundant implementations (Potential Improper API Usage) [§1]. The agents achieve what the authors term a "Modular Mirage": they split code across files, creating the appearance of modularity, but fail to establish meaningful semantic cohesion between modules [§1]. The files are separate; the logic is still tangled.

The most striking finding is the "Volume-Quality Inverse Law": total lines of code establishes a strong correlation with architectural decay [Abstract]. More code means worse structure, with high reliability. And critically, functional correctness is "decoupled from quality" — code that passes all tests is just as likely to contain structural flaws as code that fails [Abstract]. Running code provides no signal about maintainability.

Perhaps most concerning for practitioners: increasing requirement specificity in prompts — adding more detail, using few-shot examples — does not mitigate this degradation [Abstract]. The statistical analysis confirms that advanced prompting techniques fail to prevent architectural rot [§1].

The study's scope has clear boundaries. The algorithmic tasks come from a single dataset (CodeContest), and the multi-agent experiments use only MetaGPT [§1]. Whether these patterns hold across other agent frameworks like AutoGen or CrewAI, or on codebases with mixed human-AI authorship, remains untested — meaning teams can't yet rely on these specific smell profiles as universal diagnostics, and further cross-framework validation will be needed.

Why It Matters

This work reframes the central challenge of AI-based software engineering. The problem isn't getting AI to write code that works — it's preventing the code from becoming unmaintainable as it scales [Abstract]. The authors argue that future progress depends on "equipping agents with explicit architectural foresight" rather than better prompts [Abstract].

For teams today, the practical implication is concrete: if you're using AI agents to generate anything beyond isolated scripts, functional tests are insufficient quality gates. Code review focused on structural and architectural smells — coupling, cohesion, dependency patterns — becomes essential. This is a pattern emerging from controlled research, not yet reflected in production tooling. But the finding that prompt engineering doesn't help means the fix won't come from the prompt side. It will need to come from the tools themselves, or from human reviewers who know what to look for.

Vision-Language Models Can't Recognize Domain-Specific Actions

Tanush Yadav, Mohammadreza Salehi, Jae Sung Park, Vivek Ramanujan, Hannaneh Hajishirzi, Yejin Choi, Ali Farhadi, Rohun Tripathi, Ranjay Krishna

Today's best open-weight vision-language models score just 45% when asked to identify specific actions like a "devil's sonic" in pen spinning or a "triple axel" in figure skating — barely above random chance in some settings.

What they did — The researchers built VideoNet, a benchmark of 1,000 actions across 37 domains (from pen spinning to suturing), and a training dataset of ~500k video QA pairs, then evaluated 14+ VLMs on domain-specific action recognition.
Key result — The best open-weight 8B model (Qwen3-VL-8B) scored 45.0% on multiple-choice and only 59.2% on binary yes/no questions where random chance is 50%; fine-tuning a smaller 4B model on the new training data improved performance by 11.5 percentage points, surpassing all open 8B models.
Why it matters — Any product relying on VLMs to recognize specialized actions — coaching tools, medical training systems, content tagging — will need domain-specific training data, not just bigger models or better prompting.

The Problem

Action recognition was once the defining challenge of video understanding. But as vision-language models grew more capable, the benchmarks didn't keep up. Existing datasets either use coarse labels — a single class for "rock climbing" rather than distinguishing between dozens of bouldering techniques — or focus on narrow perceptual questions like whether a ball rotates clockwise [§1]. Foundation models can achieve over 92% on these coarse benchmarks, with InternVideo2 reaching 92.1% on Kinetics-400 and 95.9% on ActivityNet [§2].

The harder, more practical problem is domain-specific action recognition: telling a "triple lutz" from a "triple axel," or a "warped sonic" from a "twisted sonic reverse" in pen spinning. These distinctions require both fine-grained perception (spotting a toe-pick contact) and compositional reasoning (verifying that all elements of a multi-part action occur in the right order) [§1]. But domain-specific data is notoriously hard to collect, so VLMs have largely gone untested on this front.

Prior datasets that attempted domain-specific coverage were limited. Ego-Exo4D covers only 8 domains with limited visual diversity — its 728 bouldering videos came from just 2 climbing gyms. ActionAtlas covers 56 sports but has a benchmark 5x smaller than VideoNet's and provides no training data [§2].

What They Did

The team built VideoNet, a benchmark spanning 1,000 distinct actions across 37 domains grouped into 7 categories: hobbies (skateboarding, pen spinning, yo-yo), dance (ballet, breaking, salsa), beauty (hairstyling, tattooing), food (bartending, cooking), crafts (crochet, pottery, calligraphy), medical (suturing, neurological assessments), and sports (figure skating, cricket, ice hockey) [§3.1, Figure 2].

Action taxonomies came from expert-written sources — skateboarding actions from respected skateboarding blogs, for instance — augmented by LLMs with web-search access to cross-check domain knowledge against reputable sources [§3.1]. Each action got a definition focused on visual cues and key differentiators from similar actions.

For data collection, they designed a three-stage human annotation pipeline where five distinct annotators review each clip [§3.2]. Annotators found clips online, then verified and trimmed them. Critically, they sidestepped the need for domain experts by reducing the classification problem: instead of asking a non-expert to classify a clip as one of many actions, they asked whether a clip contains a specific action, with the action's definition and example clips provided. Expert verification indicates near-97% label accuracy [§1].

The benchmark uses two evaluation formats. A multiple-choice setting presents four action options from the same domain. A binary setting simply asks "does this video show action X?" — where random chance is 50%. The binary setting also supports few-shot evaluation: models receive 1, 2, or 3 example videos of the target action before answering [Figure 1].

Beyond the benchmark, they collected a training dataset of 160,000 clips totaling nearly 500,000 video question-answer pairs [§1].

The Results

The gap between proprietary and open models is dramatic. On multiple-choice, Gemini 3.1 Pro reaches 69.9% accuracy while the best open-weight 8B model, Qwen3-VL-8B, manages just 45.0% [Abstract]. On the binary setting — a simpler task where guessing randomly yields 50% — Qwen reaches only 59.2%, while non-expert humans score 69.1% [§1].

Few-shot examples help humans far more than models. When given three example videos of an action, non-expert human accuracy jumps by 13.6 percentage points. VLMs improve by an average of just 2.9 percentage points [§1]. The response to few-shot examples varies wildly across models: Qwen improves by 7.0 percentage points, but Gemini actually declines by 4.8 points [Abstract]. This suggests VLMs are poor in-context learners for visual actions — you can't reliably fix domain-specific gaps at inference time by showing examples.

Fine-tuning tells a different story. Training a Molmo2-4B model (half the size of Qwen3-VL-8B) on the VideoNet training data yields an 11.5 percentage point improvement, surpassing all open-weight 8B models and even some previous-generation proprietary models like GPT-4o and Gemini 2.5 [§1].

The benchmark covers 37 domains, which is broader than prior work but still a fraction of all specialized fields — and the training data pipeline, while avoiding domain experts, relies on web-sourced videos whose visual diversity and quality vary by domain. Extending this to domains with sparse online video (rare medical procedures, niche crafts) would require different collection strategies, likely adding significant cost and time.

Why It Matters

This work establishes that domain-specific action recognition is a concrete, measurable failure mode for current VLMs — not a theoretical concern. The finding that few-shot prompting barely helps (2.9 points average vs. 13.6 for humans) [§1] is particularly important for practitioners: if you're building a product that needs to distinguish between specific techniques in sports coaching, physical therapy, or skilled trades, prompt engineering with examples won't close the gap.

Fine-tuning on domain-specific data does work — a 4B model trained on VideoNet data beats 8B models that lack it [§1]. This is a lab proof-of-concept, but the dataset and code are public. For teams building specialized video understanding tools, the practical takeaway is clear: invest in domain-specific training data rather than waiting for general-purpose models to catch up.

AI Code Generators Build Working Software That Rots From Within

The Problem

What They Did

The Results

Why It Matters

Vision-Language Models Can't Recognize Domain-Specific Actions

The Problem

What They Did

The Results

Why It Matters

Quick Takes

Subscribe — free