May 27, 2026

May 27, 2026

infra_api

EAGLE 3.1 Fixes the Drift That Breaks Speculative Decoding in Production

EAGLE 3.1 tackles attention drift in speculative decoding with two architectural fixes, delivering up to 2x longer acceptance length on long-context workloads. The update ships with training support via TorchSpec and native vLLM integration.

Speculative decoding works well in benchmarks and breaks in production. Different chat templates, long-context inputs, out-of-distribution system prompts: each one chips away at acceptance length. The EAGLE team traced the root cause and shipped a fix.

The problem is called attention drift. As speculation depth increases, the drafter gradually shifts attention away from sink tokens and toward its own generated tokens. Two issues drive this. First, higher-layer hidden states dominate the fused input representation, making it increasingly imbalanced. Second, hidden-state magnitude grows across speculation steps because of an unnormalized residual path. The deeper you speculate, the less stable the drafter becomes.

EAGLE 3.1 fixes this with two architectural changes: FC normalization applied after each target hidden state and before the FC layer, and post-norm hidden states fed into the next decoding step. The design intent is precise. Post-norm makes the drafter behave more like recursively invoking itself across steps, rather than stacking extra layers onto the target model. That distinction matters for stability at depth.

The results are concrete. Compared with EAGLE 3, the 3.1 release delivers better training-to-inference extrapolation, stronger long-context robustness, higher resilience to chat template and system prompt variation, and more stable acceptance length across diverse serving environments. On long-context workloads specifically, EAGLE 3.1 achieves up to 2x longer acceptance length compared with EAGLE 3.

Two ecosystem integrations ship alongside the architecture update. TorchSpec now provides training support for EAGLE 3.1, lowering the overhead of training new drafter models. The vLLM team has integrated EAGLE 3.1 directly into the serving stack. Both are the result of a joint effort across the EAGLE team, the vLLM team, and TorchSpec, coordinated as open-source collaboration.

For product engineers running inference in production: if you are already using EAGLE 2 or EAGLE 3 and seeing acceptance length degrade on long contexts or with varied system prompts, this update directly addresses that failure mode. The practical step is to migrate to EAGLE 3.1 drafters and test acceptance length on your actual prompt distribution, not just the controlled eval set. If you need to train a custom drafter, TorchSpec is now the path to do that with lower overhead.