June 7, 2026

June 7, 2026

infra_api

vLLM 0.22.1 Fixes Multi-Node Serving and Adds AMD Zen Acceleration

vLLM v0.22.1 ships targeted bug fixes for multi-node Ray serving hangs, DeepSeek-V4 initialization failures, and several model-loading regressions, plus new support for JetBrains' Mellum v2 and zentorch-accelerated inference on AMD Zen CPUs.

vLLM v0.22.1 is a patch release on top of v0.22.0. It comes from 6 contributors, including one first-time contributor, and delivers fixes that unblock production deployments on multi-node and CPU-based setups.

The most operationally critical fix targets a deterministic hang in multi-node Ray data-parallel serving when num_api_servers > 1. The root cause was that the Ray DP backend was being pulled into the deferred, kernel-assigned port allocation introduced in an earlier change. The fix simply excludes the Ray DP backend from that path. If you run Ray-based multi-node serving at scale, this hang was not intermittent; it was guaranteed. Update now.

On the hardware side, AMD Zen CPU users get a meaningful upgrade. W8A8 (int8 dynamic-symmetric) and W4A16 (GPTQ) linear inference now routes through zentorch kernels. These are registered ahead of the generic oneDNN CPU kernels, with transparent fallback on non-Zen CPUs, GPUs, and XPU. First-time contributor @aadwived landed this work in PR #41813. If you are running quantized inference on AMD Zen hardware, this is a direct performance path you were not getting before.

Two model-loading regressions are also resolved. OlmoHybridForCausalLM was failing to initialize after a checkpoint change moved rope_parameters from None to {"rope_type": None}. HyperCLOVAX broke when the upstream HuggingFace repo removed its remote code. The fix registers the hyperclovax model_type so vLLM uses its vendored config rather than the stale auto_map, and requires transformers >= 5.9.0. DeepSeek-V4 gets a separate fix for a CUTLASS fmin compatibility issue that blocked initialization entirely.

New model support lands for JetBrains' Mellum v2, an open-weights Mixture-of-Experts code-generation model (PR #43992). If you are building coding tools and want a locally-hosted MoE option, this is now a supported path in vLLM.

Two build and CI fixes round out the release. Docker image builds were broken because flashinfer-jit-cache was being installed via --extra-index-url while the package is quarantined on PyPI. That install step is now removed. NIXL KV-connector wheel installs are also normalized so only the wheel matching the image's CUDA major version is kept, fixing an ImportError: libcudart.so.12 that appeared on CUDA 13 images.

What to do today: if you run multi-node Ray serving with num_api_servers > 1, upgrade to 0.22.1 immediately to clear the deterministic hang. If you deploy on AMD Zen CPUs with quantized models, test the zentorch kernel path this week. Check your CUDA version against your NIXL wheel if you are on CUDA 13 images.