LLM Agents Break More on Paraphrases Than on Reformatting
Rewording a question changes an LLM agent's final answer roughly 20 percentage points more often than merely reformatting it — even when both perturbations are equally severe.
- What they did — Ran ten LLMs across three benchmarks and two agent scaffolds, perturbing 1,530 original questions with five operators (paraphrase, synonym, reorder, format, distractor) to produce ~11,150 variants, then validated on a held-out 11th model with 1,800 fresh trajectories.
- Key result — The severity-matched inconsistency rate gap between meaning-bearing and presentation perturbations averaged +19.69 percentage points (paired t = 9.58, p < 0.0001; 64 of 68 cells positive).
- Why it matters — Input normalization for deployed LLM agents should focus on paraphrase-class variation rather than formatting, since meaning-bearing perturbations propagate to answer changes at systematically higher rates.
Rewording a question changes an LLM agent's final answer roughly 20 percentage points more often than merely reformatting it in controlled laboratory conditions, though the generalizability of this gap across model families remains unproven.
The Problem
LLM agents that solve multi-step reasoning tasks increasingly receive inputs that have been paraphrased by upstream models, reordered by templating systems, or lightly reformatted. A practitioner building an input-normalization pipeline needs to know: does the agent care more about *what words you use* or *how you arrange them*?
Prior work has studied perturbation robustness mostly on single-step models. PromptBench showed that prompt rewrites degrade accuracy but didn't separate a directional gap between meaning-bearing and presentation perturbations, and didn't study multi-step agent trajectories [§2]. Sclar et al. (2024) demonstrated that formatting alone can shift accuracy by tens of points, but again on single predictions, not on agents with tool use and chain-of-thought reasoning [§2]. The question of whether agents treat these two perturbation classes differently — and whether that difference survives rigorous severity matching — hadn't been answered.
What They Did
The researchers built a 68-cell measurement grid. Each cell is a combination of one model, one benchmark, and one agent scaffold. They tested ten LLMs from seven architecture families — spanning open-weight models like Llama-3.1-8B and Gemma2-9B to commercial APIs like Gemini-2.5-Flash and DeepSeek-V3 — across three benchmarks (GSM8K for grade-school math, MATH for competition math, HotpotQA for multi-hop retrieval) using chain-of-thought and ReAct scaffolds [§1].
For each cell, they ran 20–50 original questions through five perturbation operators split into two classes. The "meaning-bearing" class includes paraphrase (full question rewrite) and synonym substitution (swapping content words). The "presentation" class includes reordering tokens or clauses, changing formatting like whitespace and casing, and inserting irrelevant distractor context [Table 1]. All five operators preserve the gold answer — they're all technically meaning-preserving rewrites, but they target different parts of the input [§3.1].
The core metric is the inconsistency rate gap: for each cell, how much more often does the agent's final answer change under meaning-bearing perturbations versus presentation perturbations? [§3.2]
Critically, the team controlled for the possibility that meaning-bearing perturbations are simply "bigger" edits. They matched perturbation severity using four independent proxies — edit distance, token-level Jaccard overlap, Sentence-BERT cosine similarity, and length-change ratio — and the gap persists under all four at p < 0.0001 [§4.1, Table 3].
They then validated on a fully held-out 11th model (qwen2.5-14B-Instruct) that was never seen during analysis design, running 200 originals per cell across 9 cells for 1,800 fresh trajectories [§1].
The Results
The severity-matched inconsistency rate gap across all 68 cells averages +19.69 percentage points (paired t = 9.58, p < 0.0001), with 64 of 68 cells showing a positive gap [§4.1]. Under the four different severity proxies, the gap ranges from +18.9 to +20.9 pp, all significant at p < 0.0001 [Table 3]. Removing the two Qwen-family models (which share a family with the perturbation generator) still yields a gap of +11.10 pp (p < 0.0001) on the remaining 48 cells [§4.1].
The held-out validation confirms the pattern's structure: on "capable-and-tractable" cells, 3 of 4 show a positive gap with a mean of +1.08 pp; pooling across all partition groups, 13 of 16 cells are positive (Welch t = 3.81, p = 9.6 × 10⁻⁴) [§4.8].
Trace-level analysis reveals *where* the damage happens. Semantic perturbations leave the agent's first action intact but corrupt intermediate reasoning from step 2 onward, with per-step similarity dropping 5.6 to 10.5 pp relative to presentation perturbations, and the corruption cascading 0.17 steps deeper on average (paired t = 7.69, p = 2.5 × 10⁻¹⁴) [§4.9].
The paper is unusually transparent about what doesn't work. Wild cluster bootstrap — a more conservative statistical test appropriate for small cluster counts — is non-significant at both K=10 (model clusters) and K=7 (family clusters) [§4.2]. This means the gap is real at the cell level but cannot yet be confidently attributed to a model-level effect rather than question-level variation — a meaningful caveat for anyone trying to generalize across model families. Cross-family generator swaps destroy per-cell rankings (Spearman ρ = +0.14), so the exact magnitude depends on which system generates the perturbations [§4.5]. A second LLM judge agrees with the primary judge at only Cohen's κ = 0.50 — moderate agreement, not strong [§4.6]. Two mechanism hypotheses proposed in earlier drafts failed to replicate on the held-out model and are explicitly retracted [§4.9].
Why It Matters
This is a lab measurement, not a production tool. The benchmarks cover controlled academic tasks, not messy real-world agent deployments with mixed perturbation types and human edits. The wild cluster bootstrap failure means the finding's generalizability across arbitrary new model families remains unproven — reliable cross-family claims likely require studies with 30+ independent model clusters, which could take years to accumulate.
But the directional signal is consistent enough to inform engineering decisions today. If you're building input normalization for an agent pipeline, this work suggests that guarding against paraphrase-class variation — upstream model rewrites, user rephrasings — matters more than standardizing formatting [§1]. The stealth-divergence pattern, where the first agent step looks fine but intermediate reasoning silently corrupts, also suggests that monitoring only the first action of an agent trace is insufficient for catching perturbation-induced failures [§4.9].