Tool-Augmented AI Agents Often Learn the Protocol, Not the Capability
Multimodal AI agents that call tools during reasoning solve almost no problems that tool-free versions can't — 93–96% of their "tool-solved" answers are reachable without tools.
- What they did — Researchers compared two leading open-source tool-augmented multimodal agents (Thyme and DeepEyesV2) against tool-free versions of themselves and a separately trained pure-text reasoner across 15 benchmarks spanning real-world understanding, OCR, chart reasoning, and math.
- Key result — 93.2% of DeepEyesV2's tool-solved problems and 96.0% of Thyme's were also solved by at least one non-tool setting, leaving a tool-only solved set of roughly 4–7%.
- Why it matters — Evaluation frameworks that credit tool use based on tool-call traces may systematically overstate the capability gains tools actually provide.
Multimodal AI agents that call tools during reasoning solve almost no problems that tool-free versions can't — 93–96% of their "tool-solved" answers are reachable without tools.
The Problem
A growing class of multimodal AI agents can write and execute code mid-reasoning — cropping images, running OCR, performing calculations — and their benchmark scores are often attributed to these tool-mediated operations [§1]. But observing a tool call in a reasoning trace only proves the model *performed* one. It doesn't prove the tool supplied information the model lacked.
Consider the paper's opening example: an agent is asked what material a glove in a street-food image is made of. It writes a Python snippet, calls a code interpreter, and answers correctly. But the snippet just reloads and displays the original image — no cropping, zooming, or processing. A pure-text model with no tools answers correctly on the first try [§1]. The tool call is present in the trace, but it contributed nothing.
The core question is whether this pattern is isolated or systemic: across many problems, does tool access actually expand the set of things an agent can solve? [§1]
What They Did
The researchers studied two leading open-source tool-augmented multimodal agents — Thyme and DeepEyesV2 — both of which generate executable code during reasoning to manipulate images and perform calculations [§2.1]. They tested these agents across 15 benchmarks covering real-world understanding, OCR, chart comprehension, and mathematical reasoning [§3.1].
Each agent was evaluated in three settings. First, the **Tool-Enabled Agent**: the model running exactly as released, with full tool access. Second, the **Tool-Free Agent**: the same trained model, but with a prompt that forbids code generation and forces text-only reasoning at inference time. Third, a **Pure-Text Reasoner**: a separate model trained from the same data source pool but without ever seeing a tool-calling trajectory during training [§3.2]. The Tool-Free Agent tests whether suppressing tools at inference time hurts performance; the Pure-Text Reasoner tests whether you even need tool-calling in the training data.
Beyond aggregate scores, the researchers performed solved-set analysis — tracking *which specific problems* each setting solved — to measure overlap. If tools genuinely expand capability, the tool-enabled agent should solve a meaningful set of problems that neither non-tool setting can crack [§1].
They also ran two trajectory ablations that require no retraining. In **Tool Format Only**, they kept the model's generated code but replaced the execution result with an empty placeholder. In **Tool Result Only**, they removed the code but kept the execution output [§3.2]. This separates whether the model depends on the act of writing a tool call (the format) or on what the tool actually returns (the result).
The Results
Tool access yields little consistent aggregate improvement. On math reasoning benchmarks, DeepEyesV2's Tool-Free variant actually scores higher on 5 of 6 tasks — for example, 73.8 vs. 72.5 on MathVista and 43.1 vs. 37.8 on WeMath [Table 2]. The Pure-Text Reasoner, which never saw tool-calling trajectories during training, matches or exceeds the tool-enabled agents on most benchmarks: 74.6 on MathVista, 46.3 on WeMath, 77.5 on CharXiv descriptive [Tables 1, 2].
The solved-set analysis is the sharpest finding. Of all problems solved by the tool-enabled DeepEyesV2, 93.2% are also solved by at least one non-tool setting. For Thyme, that figure is 96.0% [Abstract, Figure 2]. The tool-only solved region — problems cracked exclusively because of tool access — is roughly 6.8% for DeepEyesV2 and 4.0% for Thyme across all task families [Figure 2].
The trajectory ablations reveal an inconsistent picture. DeepEyesV2's performance is better preserved by the tool-call format alone (keeping the code, discarding the result), while Thyme is better preserved by the returned execution result alone [§1]. Neither pattern produces stable solved-set expansion, suggesting the agents haven't learned *when* tool outputs change the answer — they've learned the protocol of calling tools [Abstract].
There are important scope limitations. The study examines two specific agents (Thyme and DeepEyesV2) on standard academic benchmarks [§3.1]. The Tool-Free Agent setting is explicitly described as "out-of-distribution" since the model was trained with tools but has them suppressed at inference, though the paper notes this should still produce a visible drop if tools were crucial to performance [§3.2]. Production workloads involving longer reasoning chains, more specialized tools, or harder visual processing tasks could show different patterns — this study establishes the finding for current benchmark-level evaluation, not for all possible tool-use scenarios.
Why It Matters
This work sits at the lab proof-of-concept stage for a new evaluation methodology, not a new capability. Its practical implication is for how teams assess and select agentic AI systems.
If you're evaluating tool-augmented agents, the presence of tool calls in reasoning traces is insufficient evidence that tools contributed to correct answers [§1]. The paper demonstrates that solved-set analysis — checking whether tool-free variants solve the same problems — is a more informative diagnostic [Abstract]. For teams building agentic systems, the finding that agents learn tool-calling patterns more reliably than tool-contributed capabilities [Abstract] suggests that training pipelines may need explicit signals for *when* tool use is necessary, not just *how* to call tools. The format-only and result-only ablations offer a lightweight, no-retraining method to audit whether your agent's tool use is substantive or procedural [§3.2].