Signal

Issue #12 · 2026-W23


This week in Signal

Tool-Augmented AI Agents Often Learn the Protocol, Not the Capability

Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Qinghao Wang, Minpeng Liao

Multimodal AI agents that call tools during reasoning solve almost no problems that tool-free versions can't — 93–96% of their "tool-solved" answers are reachable without tools.

Multimodal AI agents that call tools during reasoning solve almost no problems that tool-free versions can't — 93–96% of their "tool-solved" answers are reachable without tools.

The Problem

A growing class of multimodal AI agents can write and execute code mid-reasoning — cropping images, running OCR, performing calculations — and their benchmark scores are often attributed to these tool-mediated operations [§1]. But observing a tool call in a reasoning trace only proves the model *performed* one. It doesn't prove the tool supplied information the model lacked.

Consider the paper's opening example: an agent is asked what material a glove in a street-food image is made of. It writes a Python snippet, calls a code interpreter, and answers correctly. But the snippet just reloads and displays the original image — no cropping, zooming, or processing. A pure-text model with no tools answers correctly on the first try [§1]. The tool call is present in the trace, but it contributed nothing.

The core question is whether this pattern is isolated or systemic: across many problems, does tool access actually expand the set of things an agent can solve? [§1]

What They Did

The researchers studied two leading open-source tool-augmented multimodal agents — Thyme and DeepEyesV2 — both of which generate executable code during reasoning to manipulate images and perform calculations [§2.1]. They tested these agents across 15 benchmarks covering real-world understanding, OCR, chart comprehension, and mathematical reasoning [§3.1].

Each agent was evaluated in three settings. First, the **Tool-Enabled Agent**: the model running exactly as released, with full tool access. Second, the **Tool-Free Agent**: the same trained model, but with a prompt that forbids code generation and forces text-only reasoning at inference time. Third, a **Pure-Text Reasoner**: a separate model trained from the same data source pool but without ever seeing a tool-calling trajectory during training [§3.2]. The Tool-Free Agent tests whether suppressing tools at inference time hurts performance; the Pure-Text Reasoner tests whether you even need tool-calling in the training data.

Beyond aggregate scores, the researchers performed solved-set analysis — tracking *which specific problems* each setting solved — to measure overlap. If tools genuinely expand capability, the tool-enabled agent should solve a meaningful set of problems that neither non-tool setting can crack [§1].

They also ran two trajectory ablations that require no retraining. In **Tool Format Only**, they kept the model's generated code but replaced the execution result with an empty placeholder. In **Tool Result Only**, they removed the code but kept the execution output [§3.2]. This separates whether the model depends on the act of writing a tool call (the format) or on what the tool actually returns (the result).

The Results

Tool access yields little consistent aggregate improvement. On math reasoning benchmarks, DeepEyesV2's Tool-Free variant actually scores higher on 5 of 6 tasks — for example, 73.8 vs. 72.5 on MathVista and 43.1 vs. 37.8 on WeMath [Table 2]. The Pure-Text Reasoner, which never saw tool-calling trajectories during training, matches or exceeds the tool-enabled agents on most benchmarks: 74.6 on MathVista, 46.3 on WeMath, 77.5 on CharXiv descriptive [Tables 1, 2].

The solved-set analysis is the sharpest finding. Of all problems solved by the tool-enabled DeepEyesV2, 93.2% are also solved by at least one non-tool setting. For Thyme, that figure is 96.0% [Abstract, Figure 2]. The tool-only solved region — problems cracked exclusively because of tool access — is roughly 6.8% for DeepEyesV2 and 4.0% for Thyme across all task families [Figure 2].

The trajectory ablations reveal an inconsistent picture. DeepEyesV2's performance is better preserved by the tool-call format alone (keeping the code, discarding the result), while Thyme is better preserved by the returned execution result alone [§1]. Neither pattern produces stable solved-set expansion, suggesting the agents haven't learned *when* tool outputs change the answer — they've learned the protocol of calling tools [Abstract].

There are important scope limitations. The study examines two specific agents (Thyme and DeepEyesV2) on standard academic benchmarks [§3.1]. The Tool-Free Agent setting is explicitly described as "out-of-distribution" since the model was trained with tools but has them suppressed at inference, though the paper notes this should still produce a visible drop if tools were crucial to performance [§3.2]. Production workloads involving longer reasoning chains, more specialized tools, or harder visual processing tasks could show different patterns — this study establishes the finding for current benchmark-level evaluation, not for all possible tool-use scenarios.

Why It Matters

This work sits at the lab proof-of-concept stage for a new evaluation methodology, not a new capability. Its practical implication is for how teams assess and select agentic AI systems.

If you're evaluating tool-augmented agents, the presence of tool calls in reasoning traces is insufficient evidence that tools contributed to correct answers [§1]. The paper demonstrates that solved-set analysis — checking whether tool-free variants solve the same problems — is a more informative diagnostic [Abstract]. For teams building agentic systems, the finding that agents learn tool-calling patterns more reliably than tool-contributed capabilities [Abstract] suggests that training pipelines may need explicit signals for *when* tool use is necessary, not just *how* to call tools. The format-only and result-only ablations offer a lightweight, no-retraining method to audit whether your agent's tool use is substantive or procedural [§3.2].

Small Finetuned Models Beat Large Ones for African Speech Recognition

Victor Tolulope Olufemi, Oreoluwa Babatunde, Ramsey Njema, Bolarinwa Gbotemi, Wanchi Lucia Yen, John Uzodinma, Sunday Ajayi, Oluwademilade Williams, Kausar Moshood, Innocent Elendu Anyaele, Akebert Arefaine, Candace Hunzwi, Wongel Dawit Daniel, Emmilly Namuganga, Cleophas Kadima, Athanase Bahizire, Onitsiky Ranaivoson, Emmanuel Aaron, Nicholaus Ladislaus, Idris Muhammed, Jonathan Enoch Simenya, Martin Koome, Matewos Tegete Endaylalu, Peter Ifeoluwa Adeyemo, Hondi Prisca Birindwa, Ukachi Agnes Eze-Mbey, Yacoba Oduro-Yeboah, Pericles Adjovi, Mikel K. Ngueajio, Toluwani Aremu, Prasenjit Mitra

Compact speech recognition models finetuned on African language data cut word error rates by 27 percentage points compared to models 40× their size.

Compact speech recognition models finetuned on African language data cut word error rates by 27 percentage points compared to models 3-40× their size.

The Problem

Modern speech recognition systems like Whisper, MMS, and Omnilingual ASR claim support for hundreds or even thousands of languages. But for most African languages — particularly those with limited digital resources — these promises remain "largely unrealized," with recognition quality, model accessibility, and deployment feasibility lagging far behind, as documented across community-driven benchmarks [§1].

Two gaps drive this. First, a performance gap: large multilingual models struggle with spontaneous African speech characterized by code-switching, tonal distinctions, and dialectal diversity. They frequently exhibit "hallucination loops, over-generation, and unstable transcription behavior" [§1]. Second, an efficiency gap: the models capable of handling this diversity — like Whisper Large-v3 at 1.5 billion parameters — are too computationally expensive for the edge devices and low-cost mobile hardware where African language tools are actually needed [§1].

The question is whether small, finetuned models can close both gaps simultaneously: matching or beating large models on accuracy while being deployable on constrained hardware.

What They Did

The researchers used the WAXAL corpus, a recently released dataset of spontaneous, image-prompted speech in 19 African languages recorded in participants' natural environments — not the scripted read-aloud speech that dominates most benchmarks [§3.1]. The labeled subset comprises 2,279.1 hours and 446,169 utterances [§3.1].

They evaluated three large zero-shot foundation models — Whisper Large-v3 (1.5B parameters), MMS-1B, and Omnilingual-1B — against three compact models finetuned on WAXAL's training split: Whisper Tiny (39M parameters), Whisper Small (244M), and MMS-300M (300M) [§4]. The Whisper models received full finetuning (all weights updated), while MMS-300M used a parameter-efficient approach with frozen encoder layers. Notably, MMS-300M couldn't output text at all before finetuning — its text prediction capabilities were "learned entirely from WAXAL fine-tuning" [§4].

Before evaluation, the team applied data cleaning heuristics that proved critical. They discarded audio segments shorter than 1.5 seconds and samples where ground-truth text implied a physically impossible speech rate above 4 words per second [§3.2]. This filtering exposed reference artifacts that were inflating error rates — for example, Lingala Whisper Tiny WER dropped from 113.5% to 49.0% after cleaning [§3.2].

Beyond metrics, the team ran a distributed native-speaker audit. For each language-model combination, 1–4 native speakers reviewed 40 audio samples — 20 from the best-performing quartile and 20 from the worst — categorizing errors like phonetic substitutions, word boundary mistakes, hallucination loops, and code-switching mismatches [§3.3].

The Results

Finetuned edge models achieved a macro-averaged WER of 38.0% compared to 64.9% for the best zero-shot baseline — a 26.9 percentage-point reduction using models 3–40× smaller [Abstract]. Domain specialization, not model scale, was the dominant factor for spontaneous African speech.

But this advantage has boundaries. Cross-domain evaluation on FLEURS (a read-speech benchmark) showed that "zero-shot models regain an advantage when the test domain matches their pretraining distribution" [Abstract]. Finetuned models could "recover usable performance on out-of-distribution speech," but the domain match between training and test data was the primary driver of which approach won [§1].

Whisper Large-v3 natively supports only 4 of the 19 languages, so it was excluded from the main comparison chart [Figure 2]. This itself illustrates the coverage problem: the largest model simply doesn't serve most of these languages.

The native-speaker audit revealed that CTC-based architectures (like MMS) and autoregressive architectures (like Whisper) fail differently across language families [Abstract]. The team also found that WER alone "misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests" [Abstract] — meaning models are closer to correct than raw WER numbers indicate for some writing systems.

Several limitations constrain these findings. The native-speaker evaluation used only 40 utterances per language-model pair, was conducted as collaborative joint review rather than independent annotation, and annotators were native speakers but not trained linguists — so inter-annotator agreement isn't reported [§3.3]. The authors present qualitative findings as "exploratory linguistic observations rather than statistically confirmed patterns" [§3.3]. For practitioners, this means the error taxonomy is directionally useful but not yet rigorous enough to drive architecture selection decisions.

All evaluation was conducted on a single corpus (WAXAL), and cross-domain results on FLEURS suggest finetuned models don't fully generalize. Production systems handling mixed domains — some spontaneous, some scripted — would need more robust evaluation, likely requiring additional finetuning data from diverse sources.

Why It Matters

This work sits at the proof-of-concept stage, but a practically useful one: all 57 finetuned model weights (3 architectures × 19 languages), training scripts, and evaluation code are publicly released [§1]. Organizations building speech applications for African language markets can start from these checkpoints rather than from scratch.

The core finding — that a 39M-parameter model can outperform a 1.5B-parameter one when finetuned on domain-matched data — has direct implications for deployment cost. If your target use case involves spontaneous conversational speech in any of these 19 languages, finetuned compact models are both cheaper to run and more accurate than large multilingual APIs [Abstract]. The caveat: if your domain shifts toward read speech or formal contexts, the large zero-shot models may still win [Abstract].

Quick Takes

Subscribe — free

AI research, translated. Every week.