New RL Method Fixes Tool Use Collapse in Multimodal Agents

Agentic vision-language models need two distinct behaviors: thinking internally and calling external tools. The problem is these two behaviors are not symmetric. Thinking is the self-contained default. Tool use is a high-variance auxiliary action. That asymmetry, which the authors of AXPO call the Thinking-Acting Gap, quietly breaks standard RL training before you notice it in eval scores.

Under a standard recipe like GRPO, the gap shows up as two concrete symptoms. First, tool use is attempted on only around 30% of rollouts. Second, when a group of rollouts does attempt a tool call, all of those rollouts are wrong on roughly 40% of questions. When every rollout in a group is wrong, there is no relative signal to learn from. The gradient at the tool call is flat. The model never gets corrected on the exact decisions that need fixing.

AXPO (Agent eXplorative Policy Optimization) targets this directly. For each all-wrong tool-using subgroup, it freezes the thinking prefix and resamples only the tool call and its continuation. It pairs this with uncertainty-based prefix selection to decide which prefixes are worth resampling. This surgical resampling regenerates the learning signal at the point where it was missing.

The results are measured across nine multimodal benchmarks at three scales of Qwen3-VL-Thinking. SFT combined with AXPO outperforms SFT combined with GRPO by an average of 1.8 percentage points on Pass@1 and 1.8 percentage points on Pass@4 at the 8B scale. The more striking result: the 8B model trained with SFT plus AXPO surpasses the 32B base model on Pass@4, using four times fewer parameters.

For builders, the practical implication is straightforward. If you are fine-tuning a vision-language model for agentic tasks and using GRPO or a similar group-based RL method, your training loop is likely suppressing tool-use learning on a large fraction of your data. You are not getting signal from the rollouts that fail specifically because of bad tool calls. AXPO's fix is targeted: resample the tool call continuation when the whole group fails, rather than discarding or averaging over those rollouts.

If you are training your own agentic VLM today, audit whether your RL groups are going all-wrong on tool-using rollouts. That diagnostic alone tells you whether this gap is costing you. If it is, the AXPO approach gives you a concrete intervention to apply before scaling up compute or model size.