May 24, 2026

May 24, 2026

research

Better Token Credit Assignment Closes the RLVR Reasoning Gap

A new method called DelTA reshapes how reinforcement learning updates propagate to individual tokens during LLM training, boosting math reasoning scores by over 3 points on top baselines. Engineers building reasoning-focused models now have a concrete technique to reduce noise from high-frequency formatting tokens polluting gradient updates.

Reinforcement learning from verifiable rewards (RLVR) has become the go-to technique for improving reasoning in large language models. But there is a persistent blind spot: nobody has had a clear picture of how a sequence-level reward actually changes individual token probabilities. A new paper makes that mechanism explicit and uses the insight to build a better training method.

The core finding is that the policy-gradient update in standard RLVR acts as a linear discriminator over token-gradient vectors. The update implicitly constructs two centroids, one from positively-rewarded responses and one from negatively-rewarded responses, via advantage-weighted averaging. The direction between those centroids determines which token probabilities go up and which go down.

The problem is centroid contamination. High-frequency tokens, think formatting characters and structural boilerplate, dominate both centroids equally. Those shared patterns dilute the directions that actually separate good responses from bad ones. The signal you care about gets washed out by noise you do not.

DelTA (Discriminative Token Credit Assignment) fixes this directly. It estimates per-token coefficients that amplify side-specific gradient directions and downweight shared or weakly discriminative ones. Those coefficients reweight a self-normalized RLVR surrogate, making the effective centroids more contrastive without changing the underlying reward signal.

The numbers are concrete. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 average points on Qwen3-8B-Base and 2.62 average points on Qwen3-14B-Base. The method also generalizes beyond math: additional evaluations cover code generation, a different model backbone, and out-of-domain tasks.

For product engineers, the practical takeaway is about where training compute goes. If your RLVR pipeline is being dragged down by formatting tokens soaking up gradient updates, better centroid construction is a lever worth pulling. DelTA operates at the surrogate reweighting level, which means it slots into existing RLVR setups rather than requiring a new reward model or a different rollout strategy.

If you are training or fine-tuning reasoning models today, read the discriminator framing carefully. Understanding which token directions your update is actually optimizing is the kind of diagnostic lens that changes how you interpret training curves and debug regressions. Start there, then evaluate whether token-level coefficient estimation is worth adding to your pipeline.