May 28, 2026

May 28, 2026

infra_api

vLLM 0.21.0 Brings Blackwell Support and Smarter KV Offloading

vLLM v0.21.0 ships KV offloading integrated with the Hybrid Memory Allocator, a new attention backend for Blackwell GPUs, and speculative decoding that respects reasoning budgets. Two breaking changes require immediate attention from teams building on vLLM.

vLLM v0.21.0 arrives with 367 commits from 202 contributors, 49 of them new to the project. Two changes break existing setups before you even get to the new features.

Breaking changes first. vLLM now requires a C++20-compatible compiler, driven by compatibility requirements with PyTorch. If your build pipeline uses an older compiler, it will fail. Separately, support for transformers v4 is formally deprecated. Migration to transformers v5 is no longer optional, it is the direction the project has committed to.

KV offloading gets smarter. The KV offloading subsystem now integrates with the Hybrid Memory Allocator (HMA). This release adds scheduler-side sliding window group support, full HMA enablement, multi-connector HMA, and a MooncakeStoreConnector for distributed KV offloading. For teams running large-context workloads where KV cache pressure is a real bottleneck, this is the most impactful engine change in this release.

Blackwell gets a dedicated attention backend. A new TOKENSPEED_MLA attention backend is now available specifically for DeepSeek-R1 and Kimi-K25 prefill and decode on Blackwell GPUs. If you are deploying either model on Blackwell hardware, this backend is worth switching to immediately.

Speculative decoding now respects thinking budgets. Reasoning models that use speculative decoding previously had no mechanism to honor reasoning or thinking budget constraints. That gap is closed. Spec decode now respects those budgets, which matters for anyone running reasoning models in production where latency and token budgets are tightly controlled.

Model coverage expands significantly. New architectures include MiMo-V2.5, Laguna XS.2, Moondream3, Qianfan-OCR, Cohere MoE, and Cohere Eagle. EAGLE speculative decoding support lands for Mistral. DeepSeek V4 picks up AMD/ROCm support, pipeline parallelism, and disaggregated serving fixes. Gemma3 and Gemma4 receive MoE fixes, pipeline parallelism corrections, and tool parser crash fixes. Qwen2.5-VL gains CUDA graph support for its ViT component.

What to do today. Start by auditing your build environment for C++20 compiler support and pin your transformers dependency to v5. If you run DeepSeek-R1 or Kimi-K25 on Blackwell, enable the TOKENSPEED_MLA backend. If KV cache pressure limits your throughput, the HMA integration is worth benchmarking against your current allocator setup. Teams using speculative decoding with reasoning models should re-test latency profiles now that thinking budget enforcement is active.