Most LLM benchmarks test short prompts or single-step QA; χ-Bench forces agents to manage multi-step, policy-heavy healthcare cases where tool use, long-horizon planning, and document-grounded reasoning matter. The dataset exposes gaps between agent capabilities and real operational workflows — even the best agents in the paper achieve modest pass rates on these end-to-end tasks.
What Sets It Apart
- Realistic, long-horizon workflows: 75 task fixtures across three domains (provider prior authorization, payer utilization management, care management), plus multi-task "marathon" sessions that chain 25 tasks into continuous sessions. This stresses continuity, state tracking, and recovery over many interactions.
- Rich tool / environment surface: tasks run against a high-fidelity simulator of ~20 healthcare apps accessible via MCP or CLI, exercising tool-use, multi-step artifact authoring, and inter-agent handoffs in end-to-end provider↔payer arena.
- Hybrid scoring: deterministic verifiers plus an LLM "workspace judge" handle rubric items that cannot be checked deterministically, enabling nuanced scoring of authored artifacts and policy alignment.
- Reproducibility & leaderboard: fixtures, shared worlds, and CROISSANT metadata are packaged for reproducible experiments; the source repo provides the runner, judge harness, and leaderboard submission flow. Note: the Managed-Care Operations Handbook skill is gated and not redistributed within the Hugging Face dataset.
Who it's for — and tradeoffs
Great fit if you need a benchmark that highlights practical limitations of current agent architectures in operational healthcare settings (tool coordination, long context, policy compliance). Use it to stress-test agent memory, tool invocation policies, verifier integration, and multi-agent workflows. Look elsewhere if you only need short-form NLP tasks, lightweight unit tests, or general-purpose text datasets — χ-Bench is intentionally heavyweight and assumes access to containerized runners, Docker, and (for leaderboard reproduction) the gated handbook.
Practical signals
- Empirical headroom is large: paper results show best-agent pass@1 ~28.0% and marathon overall ~3.8%, with end-to-end provider–payer arena at 0% for top agents, underscoring the difficulty of these tasks. These numbers illustrate where research effort is most needed: robust tool use, multi-step planning, and domain-grounded retrieval.
- Experiment knobs for research: built-in skill-ablation (controlled handbook removal) and MCP-vs-CLI tool-surface ablations let you probe dependence on reference docs and tool APIs.
Where it fits
χ-Bench sits between synthetic agent benchmarks and real-world deployments: it's designed to be more operationally realistic than short benchmarks but still reproducible for research. Treat it as a stress-test for agent architectures intended for regulated, policy-driven domains rather than as a lightweight training corpus.
