Signal

Issue #11 · 2026-W22


This week in Signal

LLM Agents Break More on Paraphrases Than on Reformatting

Liyun Zhang, Jiayi Guo

Rewording a question changes an LLM agent's final answer roughly 20 percentage points more often than merely reformatting it — even when both perturbations are equally severe.

Rewording a question changes an LLM agent's final answer roughly 20 percentage points more often than merely reformatting it in controlled laboratory conditions, though the generalizability of this gap across model families remains unproven.

The Problem

LLM agents that solve multi-step reasoning tasks increasingly receive inputs that have been paraphrased by upstream models, reordered by templating systems, or lightly reformatted. A practitioner building an input-normalization pipeline needs to know: does the agent care more about *what words you use* or *how you arrange them*?

Prior work has studied perturbation robustness mostly on single-step models. PromptBench showed that prompt rewrites degrade accuracy but didn't separate a directional gap between meaning-bearing and presentation perturbations, and didn't study multi-step agent trajectories [§2]. Sclar et al. (2024) demonstrated that formatting alone can shift accuracy by tens of points, but again on single predictions, not on agents with tool use and chain-of-thought reasoning [§2]. The question of whether agents treat these two perturbation classes differently — and whether that difference survives rigorous severity matching — hadn't been answered.

What They Did

The researchers built a 68-cell measurement grid. Each cell is a combination of one model, one benchmark, and one agent scaffold. They tested ten LLMs from seven architecture families — spanning open-weight models like Llama-3.1-8B and Gemma2-9B to commercial APIs like Gemini-2.5-Flash and DeepSeek-V3 — across three benchmarks (GSM8K for grade-school math, MATH for competition math, HotpotQA for multi-hop retrieval) using chain-of-thought and ReAct scaffolds [§1].

For each cell, they ran 20–50 original questions through five perturbation operators split into two classes. The "meaning-bearing" class includes paraphrase (full question rewrite) and synonym substitution (swapping content words). The "presentation" class includes reordering tokens or clauses, changing formatting like whitespace and casing, and inserting irrelevant distractor context [Table 1]. All five operators preserve the gold answer — they're all technically meaning-preserving rewrites, but they target different parts of the input [§3.1].

The core metric is the inconsistency rate gap: for each cell, how much more often does the agent's final answer change under meaning-bearing perturbations versus presentation perturbations? [§3.2]

Critically, the team controlled for the possibility that meaning-bearing perturbations are simply "bigger" edits. They matched perturbation severity using four independent proxies — edit distance, token-level Jaccard overlap, Sentence-BERT cosine similarity, and length-change ratio — and the gap persists under all four at p < 0.0001 [§4.1, Table 3].

They then validated on a fully held-out 11th model (qwen2.5-14B-Instruct) that was never seen during analysis design, running 200 originals per cell across 9 cells for 1,800 fresh trajectories [§1].

The Results

The severity-matched inconsistency rate gap across all 68 cells averages +19.69 percentage points (paired t = 9.58, p < 0.0001), with 64 of 68 cells showing a positive gap [§4.1]. Under the four different severity proxies, the gap ranges from +18.9 to +20.9 pp, all significant at p < 0.0001 [Table 3]. Removing the two Qwen-family models (which share a family with the perturbation generator) still yields a gap of +11.10 pp (p < 0.0001) on the remaining 48 cells [§4.1].

The held-out validation confirms the pattern's structure: on "capable-and-tractable" cells, 3 of 4 show a positive gap with a mean of +1.08 pp; pooling across all partition groups, 13 of 16 cells are positive (Welch t = 3.81, p = 9.6 × 10⁻⁴) [§4.8].

Trace-level analysis reveals *where* the damage happens. Semantic perturbations leave the agent's first action intact but corrupt intermediate reasoning from step 2 onward, with per-step similarity dropping 5.6 to 10.5 pp relative to presentation perturbations, and the corruption cascading 0.17 steps deeper on average (paired t = 7.69, p = 2.5 × 10⁻¹⁴) [§4.9].

The paper is unusually transparent about what doesn't work. Wild cluster bootstrap — a more conservative statistical test appropriate for small cluster counts — is non-significant at both K=10 (model clusters) and K=7 (family clusters) [§4.2]. This means the gap is real at the cell level but cannot yet be confidently attributed to a model-level effect rather than question-level variation — a meaningful caveat for anyone trying to generalize across model families. Cross-family generator swaps destroy per-cell rankings (Spearman ρ = +0.14), so the exact magnitude depends on which system generates the perturbations [§4.5]. A second LLM judge agrees with the primary judge at only Cohen's κ = 0.50 — moderate agreement, not strong [§4.6]. Two mechanism hypotheses proposed in earlier drafts failed to replicate on the held-out model and are explicitly retracted [§4.9].

Why It Matters

This is a lab measurement, not a production tool. The benchmarks cover controlled academic tasks, not messy real-world agent deployments with mixed perturbation types and human edits. The wild cluster bootstrap failure means the finding's generalizability across arbitrary new model families remains unproven — reliable cross-family claims likely require studies with 30+ independent model clusters, which could take years to accumulate.

But the directional signal is consistent enough to inform engineering decisions today. If you're building input normalization for an agent pipeline, this work suggests that guarding against paraphrase-class variation — upstream model rewrites, user rephrasings — matters more than standardizing formatting [§1]. The stealth-divergence pattern, where the first agent step looks fine but intermediate reasoning silently corrupts, also suggests that monitoring only the first action of an agent trace is insufficient for catching perturbation-induced failures [§4.9].

Best AI Models Can Only Discover Half of Unfamiliar Physics Laws

Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson, Pavel Izmailov, Carolina Cuesta-Lázaro

When asked to discover physics laws they've never seen before — not recall textbook equations — the strongest AI models pass only about half the challenges.

When asked to discover physics laws they've never seen before — not recall textbook equations — the strongest AI models pass only about half the challenges.

The Problem

Frontier LLMs now score over 90% on graduate-level physics knowledge benchmarks like GPQA [§2]. But strong performance on such tests "reflects what a model has learned about known science, but not what it can discover" [§2]. The question is whether these models can reason their way to genuinely unfamiliar scientific conclusions — the kind of thinking that drives actual breakthroughs.

Existing benchmarks that try to test discovery ability have a structural limitation: they typically present canonical laws with shifted parameters and a known set of relevant variables, "leaving the conceptual part of scientific discovery largely unprobed" [§1]. NewtonBench shifts parameters of standard equations but preserves their functional family; PhysGym exposes named controllable variables; Gravity-Bench restricts to two-body dynamics where the relevant degrees of freedom are already evident [§2]. Real discovery often requires figuring out which variables matter in the first place.

This matters because organizations are increasingly deploying LLMs in scientific workflows — from hypothesis generation to experimental design. Understanding whether these models genuinely reason or merely retrieve patterns from training data determines how much trust to place in their outputs.

What They Did

The researchers built DISCOVERPHYSICS, a benchmark of 22 simulated worlds whose physics deliberately deviates from textbook laws [§1]. These worlds feature forces like exponentially screened gravity (where gravitational pull drops off faster than normal at distance), fractional-power laws, multi-species particle couplings, hidden "dark matter"-like particles that influence dynamics but aren't directly visible, non-coordinate-free physics, and time-varying interactions [Abstract].

The setup works like a virtual laboratory. An LLM agent receives raw trajectory data — positions and velocities of particles over time — from an N-body simulator [§3]. The agent then proposes experiments (placing test particles in specific configurations), observes the results, and iteratively refines its hypothesis about what force law governs the world [§3.1]. Think of it as giving the model a sandbox universe and asking it to play Newton: watch things move, design clever tests, and figure out the rules.

Critically, no data is stored — trajectories are generated on demand by the simulator from hidden equations, so the benchmark can't be gamed through memorization [§3]. After a fixed budget of experimentation rounds, the agent must submit both a Python implementation of the inferred law and a natural-language explanation of the world's physics [Figure 1].

Evaluation uses two complementary axes [§1]. Trajectory MSE measures whether the agent's discovered law accurately predicts dynamics on held-out test particles. An explanation score, computed by an LLM judge following a human-written rubric, measures whether the agent correctly identified the conceptual features of the world — not just whether it fit the curve, but whether it understood why the curve has that shape.

The Results

Across eleven frontier and open-source models, the strongest agents pass only about half of the 22 worlds [Abstract]. The failures cluster around worlds requiring discovery of latent structure — hidden particle species, extra dimensions, and similar features where the relevant variables aren't directly observable [Abstract].

A key finding: good predictive accuracy does not guarantee high explanation quality. "The model achieving the lowest trajectory MSE is not the one with the highest explanation score, and benefits less from additional experimental rounds, which we read as a tendency to fit data without revising its conceptual picture" [§1]. In other words, some models learn to curve-fit without understanding — a digital echo of the historical gap between empirical models and genuine theoretical insight.

Open-source models lag substantially behind commercial models on both dimensions — in designing informative experiments and in extracting conclusions from the resulting data [Abstract]. Their performance remains "largely unchanged between guided and randomized experimentation" [§1], suggesting they aren't meaningfully using the experimental feedback loop. When the experiments are chosen randomly versus strategically, it makes little difference to open-source model performance — a sign that the iterative hypothesis-refinement loop isn't functioning.

The benchmark also reveals that conceptual understanding depends on hypothesis refinement through well-chosen experiments [Abstract]. Models that design better experiments arrive at better explanations, reinforcing that scientific discovery isn't just about processing data — it's about knowing what data to collect.

Why It Matters

This is a lab proof-of-concept for measuring AI scientific reasoning, not a production evaluation tool. But the findings carry practical weight for anyone building AI-into-science pipelines.

The decoupling of prediction from explanation is the most consequential result. If your workflow relies on LLMs to generate scientific hypotheses — not just fit models — current systems have a concrete ceiling: they can match observed dynamics but frequently miss the generative mechanism behind them [§1]. This is especially true for problems involving hidden variables or latent structure, which describes a large fraction of interesting open problems in the physical sciences.

The open-source gap is also worth tracking. For organizations that need to run discovery workflows on private infrastructure, the finding that open-source models don't meaningfully benefit from iterative experimentation [§1] suggests that simply giving a model more compute or more rounds of interaction won't close the gap — the models need fundamentally better reasoning capabilities. The benchmark, simulator, and evaluation framework are publicly released [§1], so teams can track whether future models close these specific gaps.

Quick Takes

Subscribe — free

AI research, translated. Every week.