Interactive world models are advancing fast, but there has been no unified benchmark to tell builders which models actually hold up under realistic, multi-turn conditions. WBench fills that gap.
The benchmark covers five evaluation dimensions: video quality, setting adherence, interaction adherence, consistency, and physics compliance. Those dimensions map directly to what users notice when a world model breaks down, a scene shifts style mid-session, a character ignores an input, or a physics interaction looks wrong.
The dataset is concrete. It contains 289 test cases and 1,058 interaction turns. Each case defines a world setting and a multi-turn interaction sequence. Scenes span diverse environments, visual styles, subjects, and both first- and third-person perspectives.
Four interaction types are covered: navigation, subject action, event editing, and perspective switching. Navigation gets special treatment. WBench unifies text, 6-DoF pose, and discrete-action control into a single evaluation framework. That means models with very different native input interfaces can all be tested on the same navigation tasks without forcing one interface onto every model.
Scoring uses 22 automatic sub-metrics. The pipeline combines specialist vision models with large multimodal models. Critically, all 22 metrics are validated against human judgments, so the automatic scores are not just proxies that look good on paper.
The results across 20 state-of-the-art models are clear and a little humbling: no single model performs strongly across all five dimensions. Every model tested shows characteristic strengths in some areas and meaningful weaknesses in others. The benchmark surfaces those tradeoffs explicitly, giving diagnostic insight rather than a single leaderboard number that obscures what actually broke.
For product engineers building on top of world models, that matters immediately. If your application is navigation-heavy, the model that wins on overall video quality may not be the right choice. If you need strong physics compliance for a simulation use case, a model that excels at setting adherence might still fail you. WBench gives you a structured way to match model selection to your specific requirements.
Code and data are publicly available at the link above. The practical move today: run your candidate world models through WBench before committing to an integration, and weight the five dimensions by what your product actually requires. Do not treat any single model as a general-purpose solution until you have seen its WBench profile across all five axes.