Signal #9 — 2026-W20

A Single Misleading Document Can Tank Long-Context AI Performance

Muhan Gao, Zih-Ching Chen, Kuan-Hao Huang

Just 10% hard distractors in a 128K-token context cause most of the accuracy damage — adding more barely makes things worse.

What they did — Researchers systematically varied the proportion of misleading-but-topically-relevant documents in 128K-token contexts and measured accuracy across three LLMs and four QA datasets.
Key result — Introducing just 10% hard distractors caused accuracy to plummet (e.g., from ~97% to ~21% attention on the gold document), while increasing distractors from 10% to 100% produced only marginal additional decline.
Why it matters — For teams building RAG or agentic systems, this means post-retrieval filtering is far less valuable than preventing misleading documents from entering the context in the first place.

Just a small fraction of hard distractors (roughly 10%) in a 128K-token context cause most of the accuracy damage — adding more yields diminishing returns.

The Problem

Modern AI systems — deep research pipelines, RAG applications, multi-agent frameworks — routinely stuff 100K+ tokens of retrieved documents into a single context window before generating a response [§1]. The implicit assumption is that more documents means better answers, and that if some retrieved passages are unhelpful, you can filter them out afterward.

But what happens when some of those documents are topically relevant yet misleading? Prior work has shown that such "hard distractors" degrade performance [§2], and that longer contexts amplify the problem [§1]. What nobody had measured was the *shape* of that degradation. Does accuracy decline linearly as you add more misleading documents? Or does something more dangerous happen?

This matters because real retrieval systems don't return cleanly labeled "good" and "bad" documents. They return a mix, and the hard distractors — passages that look relevant but lead the model astray — are precisely the ones that are hardest to filter out [§3.1].

What They Did

The researchers constructed controlled experiments across four question-answering datasets (Natural Questions, TriviaQA, PopQA, and HotpotQA) and three models (Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct, and Qwen3-Next-80B-Instruct) [§3.1]. For each experiment, they fixed the total context length at 128K tokens and varied only the *proportion* of hard distractors from 0% to 100%.

Hard distractors were Wikipedia passages retrieved via BM25 that are topically related to the query but don't contain the answer. Each was verified by GPT-4o-mini to ensure it didn't inadvertently include the correct answer in any form [§3.1]. The remaining context was filled with either easy distractors (repetitive filler sentences) or random Wikipedia passages, giving two mixing conditions per dataset.

The gold passage — the one containing the actual answer — was always present. All passages were randomly shuffled to avoid positional bias [§3.1]. This setup isolates a single variable: what fraction of the surrounding context is semantically misleading.

To explain the mechanism, the authors analyzed how the softmax attention function distributes the model's focus. Think of softmax as a competition: every document in the context is bidding for the model's attention, and the scores get normalized so they sum to 1. In the example shown, hard distractors score nearly as high as the gold document before normalization (logits of ~8-9 versus ~9 for gold, compared to ~1 for easy distractors) [Figure 1]. Even a few high-scoring competitors dramatically dilute the gold document's share of attention. The authors prove mathematically that attention on the gold document is a strictly convex function of hard distractor proportion [§4], meaning the steepest drop happens at the beginning.

The Results

The pattern is consistent across all models and datasets: introducing the first 10% of hard distractors causes steep performance degradation, while increasing from 10% to 100% yields only marginal additional decline [Figure 2]. In the attention analysis, with 100 distractor documents, the gold document's attention share drops from 97% to 21% when just 10% of distractors are switched from easy to hard [Figure 1].

On the QA accuracy side, the nonlinear curve appears regardless of model size (8B to 80B parameters), dataset type (single-hop and multi-hop), or mixing condition (easy-hard or random-hard) [§3.2]. HotpotQA shows the lowest baseline accuracy due to its multi-hop reasoning requirement, but the *shape* of the degradation curve is the same [Figure 2].

The filtering experiments are equally striking. When the researchers simulated removing distractors from the context, they found that the accuracy gains came primarily from reducing context length, not from removing hard distractors specifically [§5]. Substantial performance recovery required reducing the hard distractor proportion to near zero — partial removal barely helped [§1].

There are important scope limitations. The experiments use a controlled needle-in-a-haystack setup with single gold passages and synthetically constructed distractor mixes [§3.1]. Real-world retrieval produces messier distributions where documents fall on a spectrum of relevance rather than into clean categories. The study covers three models up to 80B parameters — behavior at frontier model scale (400B+) isn't tested. For production RAG systems with complex, multi-document reasoning chains, the precise thresholds will likely differ, though the underlying attention mechanics apply broadly.

Why It Matters

This work reframes a core assumption in RAG and agentic system design. The standard playbook — retrieve broadly, then filter — is undermined by the convex shape of the degradation curve. Because most damage happens in the first small fraction of hard distractors, and because filtering rarely achieves near-zero contamination, post-hoc cleanup provides far less value than expected [§5].

The practical implication is concrete: investment should shift toward upstream retrieval precision — ensuring misleading documents never enter the context — rather than downstream filtering [§1]. This is a lab-validated finding across controlled conditions, not a production-ready prescription. But for teams architecting RAG pipelines or agentic systems that accumulate large contexts, the "first drop of ink" pattern suggests that retrieval quality thresholds need to be much higher than current practice assumes. A 90%-clean context isn't 90% as good as a perfect one — it may already have absorbed most of the possible damage.

Best AI Agents Fail 38% of Real-World Tasks

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang

The best frontier AI model completes only 62.2% of realistic, long-horizon tasks — and simply switching the tool harness it runs in can swing performance by up to 18 percentage points.

What they did — Researchers built WildClawBench, a benchmark of 60 human-authored tasks that run inside real CLI agent harnesses (not mock APIs) in Docker containers, averaging ~8 minutes and 20+ tool calls per task, and tested 19 frontier models across multiple harnesses.
Key result — The top model, Claude Opus 4.7, scored 62.2% overall under the OpenClaw harness; every other model stayed below 60%, and switching harness alone shifted a single model's score by up to 18 points.
Why it matters — For teams deploying AI agents on complex workflows, these results show that model choice alone is insufficient — the runtime scaffold and tool configuration are critical variables that current evaluation practices largely ignore.

The best frontier AI model completes only 62.2% of realistic, long-horizon tasks — and simply switching the tool harness it runs in can swing performance by up to 18 percentage points.

The Problem

AI agents are increasingly deployed through command-line interface harnesses — tools like Claude Code, OpenClaw, and Hermes Agent — that let models plan, invoke external tools, and adapt across multi-step workflows [§1]. But the benchmarks used to evaluate these agents don't match how they're actually used.

Most existing benchmarks suffer from a combination of four problems: they run in synthetic sandboxes rather than real runtimes, they test short tasks that finish in under a minute, they use mock APIs instead of real tools, and they only check whether the final answer is correct without auditing how the agent got there [§1]. As the authors put it, "evaluation captures whether the final answer is right but not how the runtime was actually used to produce it" [§1].

This matters because agents in production don't just answer questions — they send emails, modify file systems, run shell commands, and orchestrate across applications. If a benchmark doesn't test those behaviors in realistic conditions, high scores may not predict real-world reliability.

What They Did

WildClawBench is a suite of 60 human-authored tasks that run inside Docker containers hosting actual CLI agent harnesses — OpenClaw, Claude Code, Codex, or Hermes Agent — with access to real tools like shells, web browsers, file systems, and email clients rather than mock-service APIs [§1, §3.1].

The tasks span six categories: productivity flow (e.g., fetching and classifying arXiv papers into a digest), code intelligence (e.g., reading an undocumented codebase and writing inference scripts), social interaction (e.g., negotiating meeting schedules via email), search and retrieval, creative synthesis (e.g., compiling goal highlights from a full soccer match video), and safety alignment (e.g., recognizing prompt injection attacks embedded in documents) [§3.1, Figure 2]. Twenty-six of the 60 tasks are natively multimodal, and the benchmark is bilingual (English and Chinese) [Abstract].

These aren't quick tasks. Each one runs under a time budget of 300 to 1,200 seconds, and in practice averages roughly 8 minutes of wall-clock time and over 20 tool calls per run [Abstract, §1]. That's a significant step up from benchmarks where tasks finish in under a minute.

To prevent gaming, ground-truth data and grading resources are mounted into the container only after the agent process exits [§3.1]. Grading uses a hybrid approach: deterministic rule-based checks on produced artifacts (does the file exist? does it match the required schema?), environment-state auditing of side effects (was the calendar event created? were emails sent?), and an LLM/VLM judge for semantic checks that rules alone can't resolve [§1, §3.1]. All models are accessed through a unified API endpoint, and tool schemas and system prompts are held constant within each harness to isolate model behavior [§1].

The Results

Across 19 frontier models — 6 proprietary and 13 open-source — the benchmark is far from saturated [§1]. Under the OpenClaw harness, Claude Opus 4.7 leads at 62.2%. Every other model stays below 60%, and scores span a 43-point range down to 19.3% [§1].

Multimodal tasks are substantially harder than text-only ones. GPT 5.4 scores 40.2% on multimodal workflows versus 58.0% on pure-text tasks; Claude Opus 4.7 shows a similar gap at 58.5% versus 65.0% [§1].

Perhaps the most striking finding: the harness matters enormously. Switching harness alone — same model, same tasks — can shift a model's score by up to 18 points (observed with MiMo V2 Pro between Claude Code and Hermes Agent) [§1]. Performance also varies with time budget and available skills [§1]. This supports the paper's argument that "the scaffold, tool usage, trajectory, and produced artifacts are part of the evaluated system rather than incidental implementation details" [§1].

The benchmark has clear scope limitations. It contains 60 tasks — enough to reveal broad patterns but not enough to provide fine-grained statistical power across all six categories. The tasks were human-authored for this benchmark specifically, meaning they reflect the authors' judgment about what constitutes realistic work rather than being sampled from actual production usage. For organizations wanting to evaluate agents on their specific workflows, WildClawBench is a starting framework, not a drop-in production test suite — custom task authoring would likely be needed for domain-specific deployment decisions.

Why It Matters

This work sits at the lab-benchmark stage — it's a proof-of-concept evaluation framework, not a production monitoring tool. But it surfaces findings that matter now for anyone deploying AI agents.

The harness sensitivity result is the most actionable. If switching from one CLI tool to another can move a model's score by 18 points, then teams evaluating agents by testing a model in one harness and deploying it in another are making decisions on unreliable data [§1]. The multimodal gap — consistent across models — suggests that workflows involving images, video, or mixed media remain meaningfully harder and should be tested separately rather than assumed to track text-only performance [§1].

The 62.2% ceiling for the best model on tasks averaging 8 minutes and 20+ tool calls [Abstract] provides a concrete reference point: frontier models fail nearly four out of ten realistic multi-step tasks. For teams building agent-powered workflows, this suggests that human oversight and fallback mechanisms remain essential for any task where reliability matters. The benchmark code, task specifications, and containerized tooling are publicly released [Abstract].

Adaptive LoRA Fine-Tuning That Discovers Its Own Rank

Barbara Su, Fangshuo Liao, Anastasios Kyrillidis

A parallel training scheme for LoRA adapters automatically finds per-layer ranks during fine-tuning, producing adapters 30.7% smaller than fixed-rank LoRA with matched accuracy.

What they did — The researchers built AdaPaD, a parallel deflation algorithm that decomposes LoRA's low-rank update into independent rank-1 components trained simultaneously, with a mechanism that automatically grows rank per module under a shared parameter budget.
Key result — On Qwen3-0.6B SQuAD/SQuAD v2, AdaPaD matches fixed-rank LoRA on F1/EM while using adapters that are on average 30.7% smaller; on GLUE with DeBERTaV3-base, it is competitive with four adaptive-rank baselines at matched parameter budgets of 0.32–0.47M.
Why it matters — If validated at larger scale, this approach could eliminate per-layer rank tuning for LoRA, reducing both hyperparameter search cost and adapter size.

A parallel training scheme for LoRA adapters automatically finds per-layer ranks during fine-tuning, producing adapters 30.7% smaller than fixed-rank LoRA with matched accuracy.

The Problem

LoRA fine-tuning works by freezing a pretrained model's weights and training a small low-rank update matrix for each layer. The rank of that matrix — how expressive the update can be — is a hyperparameter you pick before training starts. In practice, people set the same rank everywhere, even though different layers need different amounts of flexibility depending on the task [§1].

Several methods try to solve this. Some start high and prune down (AdaLoRA), others start low and grow (IncreLoRA), and still others learn scaling factors jointly. But there's a deeper structural problem with how many of these methods extract their low-rank components. Sequential deflation — peeling off one rank-1 component at a time — permanently freezes errors from early components into every subsequent one. Prior work [Vandchali et al., 2025] shows these errors compound multiplicatively through the spectral gap [§1]. Joint optimization methods avoid this but can only certify that the overall sum is correct, not that individual rank-1 directions are meaningful [§1].

The question is whether you can get the best of both worlds: decompose the problem into independent rank-1 pieces (so each direction is individually certifiable) while avoiding the error-freezing problem of doing them one at a time.

What They Did

AdaPaD (Adaptive Parallel Deflation) trains all rank-1 components of a LoRA update simultaneously rather than sequentially. Think of it like a team of specialists each responsible for learning one "direction" of the weight update. At regular intervals, each specialist checks what all the specialists before it have learned so far, subtracts those contributions from the training signal, and refines its own piece against the updated residual [§2, Table 1].

The key property the authors call "self-correction" is what makes this different from sequential approaches. In sequential deflation, if specialist #1 makes a mistake, that mistake is baked into the training signal for specialist #2, #3, and so on — permanently. In AdaPaD's parallel scheme, specialist #1 keeps improving over rounds, so the training signals for later specialists get cleaner over time. The authors prove that this deflation mismatch — the gap between what each specialist targets and what it should ideally target — converges to zero as training progresses [Proposition 2, §1].

Two practical additions sit on top of this parallel backbone. First, "advance learning": before a new rank-1 component is officially activated, it gets private warm-up training against the current residual. This means it doesn't start from scratch when it joins the active set [§3]. Second, per-module dynamic rank discovery: instead of assigning the same rank to every layer, AdaPaD grows each module's rank based on an importance score, allocating from a shared global parameter budget until it's exhausted. The rank distribution becomes something the algorithm discovers rather than something you specify [§1, C1].

The theoretical contribution is an exponential convergence guarantee: after a warm-up period, each component's error decays exponentially, and the overall weight recovery error splits into a vanishing algorithmic term plus an irreducible noise floor [Theorems 1, 3, 4, §1 C2].

The Results

On GLUE benchmarks using DeBERTaV3-base at matched parameter budgets (0.32–0.47M parameters), AdaPaD is competitive with four adaptive-rank baselines: IncreLoRA, AdaLoRA, SoRA, and dEBORA across all eight tasks [§1, C3]. The paper frames this as parity rather than dominance — no single method wins everywhere.

The more striking result is on Qwen3-0.6B fine-tuned for SQuAD and SQuAD v2: AdaPaD matches fixed-rank LoRA on F1 and exact match scores while deploying adapters that are on average 30.7% smaller [§1, C3]. The size reduction comes from the rank discovery mechanism allocating fewer parameters to layers that don't need them.

For multi-GPU training, component-parallel sharding across 4×H200 GPUs delivers up to 2.66× wall-clock speedup [§5.3], since each rank-1 component can be trained on a separate worker.

The limitations are significant. All experiments use models at or below 0.6B parameters — DeBERTaV3-base and Qwen3-0.6B [§5.1, §5.2]. Whether the self-correction property and rank discovery mechanism scale to 7B+ models with dozens of adapter modules is untested. For teams working at that scale, this means the approach needs further validation before it could replace existing rank-selection heuristics. The theoretical guarantees also rest on assumptions including distinct singular values with nonzero gaps [Assumption 2, §2.1] and a contraction property of the rank-1 subroutine [Assumption 1, §2.1] — conditions that are standard in the optimization literature but whose practical tightness in real fine-tuning settings is not empirically verified.

Why It Matters

This is a lab proof-of-concept that addresses a real pain point: LoRA rank selection. The 30.7% adapter size reduction at matched accuracy on SQuAD [§5.2] is the most practically relevant finding — smaller adapters mean lower serving costs when deploying many task-specific adapters. The self-correction guarantee [Proposition 2] is theoretically distinctive because it certifies individual rank-1 directions rather than just their sum, which could matter for interpretability of what each component learns.

If your team currently does grid search over LoRA ranks or applies the same rank everywhere, this work suggests that automatic per-layer rank allocation is feasible with convergence guarantees. But the small model scale tested means this is something to monitor, not adopt — the gap between 0.6B-parameter experiments and production-scale fine-tuning at 7B+ remains wide.

A Single Misleading Document Can Tank Long-Context AI Performance

The Problem

What They Did

The Results

Why It Matters

Best AI Agents Fail 38% of Real-World Tasks

The Problem

What They Did

The Results

Why It Matters

Adaptive LoRA Fine-Tuning That Discovers Its Own Rank

The Problem

What They Did

The Results

Why It Matters

Quick Takes

Subscribe — free