May 24, 2026

May 24, 2026

infra_api

vLLM 0.21 Brings HMA, Spec Decode Budgets, and a Build Break

vLLM v0.21.0 lands KV offloading with Hybrid Memory Allocator integration, speculative decoding support for reasoning budgets, and a C++20 build requirement that will break existing setups. Here is what teams running inference infrastructure need to act on now.

vLLM v0.21.0 ships 367 commits from 202 contributors, 49 of them new. Three changes demand immediate attention from anyone running or building on vLLM.

The build requirement changed. vLLM now requires a C++20-compatible compiler. This is a breaking build change, driven by PyTorch compatibility. If your CI pipeline builds vLLM from source, it will fail until you update your toolchain.

Transformers v4 is deprecated. This release formally drops support for transformers v4. Migration to transformers v5 is now required. Teams pinned to v4 in their environments need to plan that upgrade before the next release makes it a hard block.

KV offloading got a significant upgrade. The KV offloading subsystem now integrates with the Hybrid Memory Allocator (HMA). The changes include scheduler-side sliding window group support, full HMA enablement, multi-connector HMA, and a MooncakeStoreConnector for distributed KV offloading. For teams running large-context workloads where GPU memory is the bottleneck, this is the most operationally relevant change in the release.

Speculative decoding now respects thinking budgets. Reasoning models that use a thinking or reasoning budget were previously incompatible with spec decode. That is fixed. Spec decode now correctly handles those budgets, which means you can layer latency optimizations on top of reasoning models without correctness issues.

Blackwell GPU users get a new attention backend. A TOKENSPEED_MLA backend is now available specifically for DeepSeek-R1 and Kimi-K2.5 prefill and decode on Blackwell GPUs.

On the model side, new architectures include MiMo-V2.5, Laguna XS.2, Moondream3, Qianfan-OCR, Cohere MoE, and Cohere Eagle. EAGLE speculative decoding support expands to Mistral, Gemma4, MiMo-V2.5, and Cohere Eagle. DeepSeek V4 adds AMD/ROCm support and pipeline parallelism. Gemma3 and Gemma4 receive several MoE fixes and a tool parser crash fix.

For Model Runner V2, Qwen3.5/Mamba hybrid models are now supported, and ViT CUDA graph support lands for Qwen2.5-VL.

What to do today: Check your compiler version before pulling this release. Audit your transformers pin and start the v5 migration if you have not. If you are running KV offloading, review the HMA documentation to understand whether the new allocator changes your memory configuration defaults. And if you are using spec decode with any reasoning model, this release removes the blocker that was forcing you to choose between the two features.