OpenCompass 0.5.2 Adds 14 Benchmarks and Broader Model Coverage

OpenCompass 0.5.2 lands with a broad sweep of new evaluation coverage. The headline is benchmark volume: 14 new benchmarks added in a single release, touching scientific reasoning, biology, chemistry, math competition problems, code, and instruction-following.

On the science and domain-specific side, the release adds SciReasoner, Biology Instructions, Mol Instructions, and CMPhysBench. For math competition coverage, HMMT2025, AMO-Bench, and IMO-Bench are now supported. Code evaluation gets LCB-pro and a fix to a buffer-related error in LiveCodeBench. Instruction-following evaluation gains IFBench and PI-LLM. Rounding out the list: ATLAS, OpenSWI, ARC_AGI_2, and ProcessBench.

That breadth matters if you are assembling evaluation suites for specialized models. Domain fine-tunes in biology or chemistry previously had few ready-made configs in open evaluation frameworks. Now you can drop these benchmarks into an existing OpenCompass pipeline without writing configs from scratch.

Model support also grows. Intern-S1-Pro evaluation examples are included, and TeleChat API inference is now supported. If your team is evaluating either of these, the integration work is done for you.

Two infrastructure additions are worth noting for teams running evaluations at scale. Multi-dimensional metric monitoring now covers output length, logprobs, and finish reasons. That gives you signals beyond accuracy scores, useful when you care about verbosity or whether completions are hitting token limits. The parametrized timeout in OpenAISDK is also a quiet quality-of-life improvement: you can now tune timeout behavior without patching library code.

On the bug fix side, OpenAISDK streaming received multiple fixes covering output completeness. If you have seen truncated outputs in streaming evaluations, this release addresses that directly. A pattern match fix in Smolinstruct and the LiveCodeBench buffer error round out the reliability improvements.

CI also got attention. New dataset CI jobs were added, a uni-test was introduced, and the GitHub runner was updated. For contributors and teams running OpenCompass in CI pipelines, this reduces the chance of regressions slipping through.

Four new contributors shipped code in this release, including the PI-LLM support and TeleChat API integration.

What to do now: If you are evaluating models with domain-specific capability requirements, pull 0.5.2 and run the new science and instruction-following benchmarks against your current model versions. The output length and logprobs monitoring is also worth enabling immediately, since those signals are cheap to collect and often reveal behavior that accuracy metrics miss.