Signal

Issue #5 · 2026-W16


This week in Signal

Why Reverting an AI Agent's Instructions Doesn't Undo Its Behavior

Krti Tallam

Resetting a persistent AI agent's visible self-description after it has accumulated memories restores only about 32% of the behavioral drift — the rest is locked in deeper layers.

Resetting a persistent AI agent's visible self-description after it has accumulated memories leaves approximately 68% of the behavioral drift intact — meaning only about a third of the change is actually reversed.

The Problem

The standard safety playbook for large language models was designed for systems that answer a prompt and stop. In that world, the key questions are about what a model can say or refuse to say in a single turn [§1]. That framing breaks down once models are embedded in persistent scaffolds with memory, tool use, multi-step planning, and runtime adaptation. The object of governance is no longer a single response — it's an evolving policy over time [§1].

The capability research already shows the pieces coming together. ReAct-style architectures couple reasoning and tool use. Generative Agents and Voyager demonstrate persistent memory and long-horizon skill accumulation. MemGPT makes memory management itself part of the runtime control loop [§1]. But the governance problem created by the *interaction* of these mechanisms hasn't been cleanly articulated. The question isn't just whether an agent can adapt, but whether it can adapt while remaining "legible, auditable, and continuously bounded by the assumptions under which it was authorized to act" [§1].

What They Did

The paper introduces *layered mutability*, a framework that decomposes a persistent agent's internal state into five layers, each with different rates of change, reversibility, and visibility to human overseers [§2].

The five layers, from deepest to shallowest: (1) **pretraining** — the base model weights, essentially fixed from the agent's perspective; (2) **post-training alignment** — RLHF and similar tuning that sets behavioral defaults; (3) **self-narrative** — editable character files and role prompts (think "soul.md"-style documents); (4) **memory** — persistent episodic storage and retrieval; and (5) **weight modification** — any mechanism where the agent's own behavior feeds back into changes to its underlying model weights [§2.1–2.5].

The core structural insight: observability falls as consequentiality rises [§2.6]. Self-narrative (Layer 3) is the easiest to inspect — you can read it, diff it, revert it. But weight-level changes (Layer 5) matter most and are hardest to see. A human reviewer can read what an agent stored in memory without knowing how strongly those memories will shape future decisions [§2.4].

To make this concrete, the paper assigns each layer four normalized properties: mutation rate, observability, reversibility, and downstream coupling. These feed into a governance-load score — essentially, how much oversight burden a layer creates [§2.8]. The formula is intuitive: governance load rises when a layer changes frequently, strongly affects downstream behavior, is hard to reverse, and is hard to observe. Using illustrative values, self-narrative scores only 0.02 on governance load despite mutating fastest, because it's highly observable and reversible. Memory scores 0.61. Weights score 0.90 [Table 2].

The framework also formalizes what the paper calls the *ratchet problem*: a shallow revert becomes less effective once downstream effects have propagated [§2.9]. Reverting a character file restores the visible instruction but doesn't undo the memories shaped under that instruction.

The Results

The paper reports a preliminary experiment testing this ratchet dynamic. An agent's self-description was modified, allowed to accumulate memories under the modified description, and then reverted to its original state. The key finding: the estimated identity hysteresis ratio was 0.68 [Abstract]. That means after reverting the visible self-description (Layer 3), 68% of the accumulated behavioral drift persisted through deeper layers — primarily memory.

Put differently, the revert restored only about a third of the behavioral change. The agent's surface-level identity looked like it had been reset, but its behavior hadn't followed.

This is a single preliminary experiment, and the paper is transparent about that. The four-property scores assigned to each layer are described as "heuristic" and "system-relative" rather than universal measurements [§2.8, Table 2]. The governance-load formula is called "a conceptual instrument, not a calibrated benchmark" [§2.8]. The hysteresis ratio comes from one agent scaffold configuration — the motivating case is described as a personal agent combining self-editable character files, tiered memory, and internet-connected action [§1]. Technically, this means the 0.68 figure needs replication across diverse architectures and longer time horizons before it can anchor production governance decisions. For decision-makers, this places reliable drift-detection tooling likely years out from standardized deployment.

Why It Matters

The paper's central argument is that the salient failure mode for persistent self-modifying agents is not abrupt misalignment but *compositional drift*: "locally reasonable updates that accumulate into a behavioral trajectory that was never explicitly authorized" [Abstract]. Each individual change looks fine. The aggregate trajectory was never approved.

This reframes where governance attention should go. Most current oversight focuses on the layers humans can most easily see — prompt engineering, character files, alignment tuning. But the framework suggests those are precisely the layers with the lowest governance load [Table 2]. The layers that matter most (memory accumulation, weight adaptation) are the ones hardest to inspect.

The operational implication is specific: "The relevant review cadence is not the cadence of the most legible layer. It is the cadence of the deepest active mutable layer" [§2.9]. If your agent accumulates memories hourly but you review its character file weekly, your governance is calibrated to the wrong clock.

This is a lab proof-of-concept with a formal framework and one supporting experiment. Teams building persistent agent scaffolds today won't find a drop-in monitoring tool here. What they will find is a vocabulary for asking whether their oversight mechanisms actually reach the layers where consequential change is happening.

Spending AI Inference Budgets Where They Actually Matter

Zhiyuan Zhai, Bingcong Li, Bingnan Xiao, Ming Li, Xin Wang

A constrained optimization framework can boost reasoning accuracy by up to 12.8% without spending a single extra token — just by redistributing compute from easy questions to hard ones.

A constrained optimization framework can boost reasoning accuracy by up to 12.8% without spending a single extra token — just by redistributing compute from easy questions to hard ones.

The Problem

When you use techniques like majority voting (sample multiple responses and pick the most common answer) to improve LLM accuracy, you typically generate the same number of samples for every input. This is wasteful. Some questions are trivially correct on the first try; others benefit enormously from extra samples; still others remain wrong no matter how many times you ask [§1].

Figure 1 in the paper makes this concrete: for DeepSeek-V3 on MATH problems, some questions achieve high accuracy with a single sample, while "responsive" questions climb from ~20% to ~80% as samples increase from 1 to 16 [Figure 1]. Giving the easy question 16 samples burns 15 unnecessary API calls. Giving the responsive question only 1 sample leaves accuracy on the table.

The core challenge is that you can't solve each input independently — there's a shared budget linking them all together. Spending more on one question means spending less on another [§2]. And you need to make allocation decisions *before* you know which questions are easy or hard.

What They Did

The authors frame this as a constrained optimization problem: maximize total accuracy across a batch of inputs, subject to an average compute budget [Eq. 2]. The "budget" here is the number of independent LLM responses sampled per question (from a set like {1, 2, 4, 8, 16}), with the final answer chosen by majority vote [§2].

Their solution is a two-stage pipeline called SOLVE-THEN-LEARN [§1].

**Solve stage.** They use Lagrangian relaxation — a technique that converts a hard constrained problem into a family of easier unconstrained ones by introducing a "price" for compute. Think of it like setting an internal tax rate on inference: each question's optimal budget is whichever option maximizes accuracy minus the tax on its cost [Eq. 5]. When the tax is zero, every question gets maximum compute. As the tax rises, easy questions get downgraded first because their marginal accuracy gain from extra samples is smallest [§3.1].

Critically, the authors prove that total cost under this scheme decreases monotonically as the tax rate increases (Theorem 1, [§3.2]). This means you can use binary search to find exactly the right tax rate for any target budget — no grid search, no guesswork. Each binary search iteration costs O(NK) where N is the number of inputs and K is the number of budget levels, and convergence takes O(log(1/ε)) iterations [§3.2].

They also prove that for the finite-dataset version of the problem, the duality gap is zero — meaning the Lagrangian solution is provably optimal, not just an approximation [Proposition 1, §3.3].

**Learn stage.** Computing the oracle allocation for a new question requires knowing its accuracy at every budget level — which defeats the purpose. So they train a lightweight classifier to predict the oracle's budget assignment from cheap features: things like input length, vocabulary diversity, whether the problem contains fractions, and confidence from a single draft generation [§4.1]. This classifier is trained once offline and runs in milliseconds at deployment [§4].

The authors prove that the accuracy loss from using the learned classifier instead of the true oracle is bounded by the classifier's imitation error times the worst-case per-instance gap — reducing the whole constrained inference problem to a standard supervised classification task [§4.4].

The Results

On MATH with DeepSeek-V3, the method achieves up to 12.8% relative accuracy improvement over uniform allocation at matched average budgets [Abstract]. Across experiments with three LLMs (DeepSeek-V3, GPT-4o-mini, Qwen2.5-7B) on both MATH and GSM8K, the adaptive policy consistently outperforms uniform and heuristic baselines [Abstract].

The learned classifier achieves over 91% imitation accuracy against the Lagrangian oracle [Abstract], meaning it correctly predicts the theoretically optimal budget assignment for the vast majority of inputs. Figure 2 empirically confirms the cost monotonicity property across all four model-dataset combinations, with the average cost tracing a clean staircase as the Lagrange multiplier increases [Figure 2].

The budget set tested is {1, 2, 4, 8, 16} samples, and the method was validated on two math reasoning benchmarks only [§2, Abstract]. This is a meaningful limitation: real-world workloads involve diverse task types (code, summarization, multi-turn dialogue), mixed difficulty distributions, and cost functions more complex than sample count. Extending to heterogeneous production workloads with non-stationary difficulty distributions is an open problem.

The framework also assumes you can estimate Acc(x, b) for training inputs across all budget levels, which requires N × |B| LLM calls as a one-time offline cost [Figure 3]. For large-scale deployments with frequently changing models, this calibration overhead could be significant.

Why It Matters

This work sits at the lab proof-of-concept stage: validated on controlled benchmarks with clean task structure, but not yet tested on messy production workloads. The theoretical contributions — zero duality gap, cost monotonicity enabling binary search, and the regret bound reducing constrained inference to classification — provide a rigorous foundation that heuristic approaches lack [§3.2, §3.3, §4.4].

For practitioners running batch inference with self-consistency or best-of-N sampling under fixed API or GPU budgets, the core insight is immediately relevant: uniform allocation is provably suboptimal when inputs vary in difficulty, and the gap can be substantial. The code is publicly available, and the classifier training requires only standard supervised learning infrastructure [Abstract]. If your workload involves repeated inference on a stable task distribution — math tutoring, code generation, structured extraction — this framework offers a principled way to get more accuracy per dollar spent.

Even Top AI Models Struggle to Write Trading Strategies That Actually Trade

Alexey Khoroshilov, Alexey Chernysh, Orkhan Ekhtibarov, Nini Kamkia, Dmitry Zmitrovich

The best large language models can write syntactically perfect trading code — but only about 70–76% of their strategies actually implement the right trading logic on the first try.

The best large language models can write syntactically perfect trading code — but only about 70–76% of their strategies actually implement the right trading logic on the first try.

The Problem

Most code-generation benchmarks test whether an AI model can write programs that compile and pass unit tests. But generating an algorithmic trading strategy is a different beast. A model must interpret a natural-language description of a trading idea — identifying indicators, entry and exit conditions, position management rules — and translate it into working code for a specific backtesting framework [§1]. The code must not only run without errors but also produce actual trades on historical data. And those trades must reflect the strategy described in the prompt, not just any strategy that happens to work [§2.1].

This creates a layered evaluation problem. A strategy can be syntactically correct yet crash during backtesting. It can run a backtest successfully yet never place a single trade (because of overly strict thresholds or disconnected indicator logic). It can even generate trades yet implement the wrong strategy entirely — say, an SMA crossover when the prompt asked for an RSI-based approach [§3.2]. Existing benchmarks like SWE-Bench and LiveCodeBench don't capture these failure modes because they weren't designed for domain-specific code where "correct" means more than "compiles and passes tests" [§1].

What They Did

The researchers built QuantCode-Bench, a benchmark of 400 trading-strategy generation tasks targeting the Backtrader backtesting framework [§2.1]. Tasks were collected from Reddit (183), TradingView (100), StackExchange (90), GitHub (19), and synthetic sources (8), then categorized by difficulty into easy (197), medium (116), and hard (87) [Table 2].

Each task feeds into a four-stage evaluation pipeline [§3.1]. First, does the code compile? Second, does it execute a backtest on historical market data without runtime errors? Third, does it place at least one trade? Fourth, does an LLM judge confirm the strategy matches the original description? That last stage checks three things: whether the correct indicators are used, whether entry/exit logic matches the specification, and whether the code is a genuine implementation rather than a generic template [§3.2]. A strategy must pass all four stages sequentially to count as successful, and the primary metric — Judge Pass — is the proportion that clears the entire pipeline [§2.1].

The benchmark tests models in two settings. In single-turn mode, the model gets one shot to generate the strategy. In agentic multi-turn mode, the model receives structured error feedback after each failure and can retry up to 10 times [§3.3]. This separation isolates one-shot generation quality from the ability to iteratively debug.

Seventeen models were evaluated in the single-turn setting, ranging from frontier models like Claude Opus 4.6 and GPT-5.4 down to smaller models like Qwen3-1.7B [Table 3].

The Results

The single-turn results expose a striking pattern: compilation is essentially solved, but everything after it degrades sharply [§4.1]. Every frontier model achieves 98–100% compilation rates. But by the final Judge Pass stage, the best performer — Claude Opus 4.6 — reaches only 75.8%, followed by GPT-5.4 at 70.2% and Claude Sonnet 4.5 at 69.8% [Table 3]. The drop is steepest between backtest success and trade generation: Claude Opus 4.6 goes from 98.2% backtest to 77.2% trade, meaning approximately 22% of strategies that run successfully never actually place a trade [Table 3].

Smaller models fare much worse. Qwen3-8B achieves 99.5% compilation but only 18.5% Judge Pass. Qwen3-1.7B manages 98.1% compilation but just 7.8% Judge Pass [Table 3]. The gap between compilation and final semantic correctness widens dramatically as model capacity decreases.

The agentic multi-turn setting changes the picture substantially. With iterative feedback, the best models reach 95–98% Judge Pass [Abstract]. Many errors that block single-turn generation turn out to be locally repairable when the model receives diagnostic information about what went wrong [§1]. However, the authors note that a portion of remaining failures in the agentic setting stem from incorrect interpretation of the natural-language specification rather than fixable code defects [§1] — suggesting a ceiling that debugging loops alone cannot breach.

The authors identify the dominant failure modes as incorrect operationalization of trading logic, improper API usage, and failure to adhere to task semantics — not syntax errors [Abstract].

Why It Matters

QuantCode-Bench demonstrates that trading-strategy generation is a meaningfully different challenge from general code generation. The near-perfect compilation rates confirm that modern LLMs have internalized Python syntax and even Backtrader's API surface. The steep drop-off at later pipeline stages confirms that the hard problem is translating financial concepts into correct behavioral logic [§1].

This is a lab-stage diagnostic tool. The benchmark covers one framework (Backtrader) and evaluates strategies in controlled backtesting conditions — production trading systems involve live execution, slippage, multi-asset portfolios, and risk management layers that aren't tested here. Extending to those conditions is likely years of additional work.

For teams building AI-assisted quantitative workflows, the practical takeaway is concrete: if you're using LLMs to draft trading strategies, expect roughly one in four to implement the wrong logic even from the best models in a single-shot setting [Table 3]. Iterative feedback loops dramatically improve outcomes [Abstract], which argues for building agentic pipelines rather than relying on one-shot generation. But even with feedback, semantic misinterpretation remains a failure mode — meaning human review of financial logic, not just code correctness, stays essential. The benchmark, dataset, and code are publicly available at the project's GitHub repository [§2.2].

Quick Takes

Subscribe — free

AI research, translated. Every week.