Most existing medical-agent benchmarks judge only final outputs, obscuring where autonomous agents actually fail during multi-step research workflows. AutoMedBench reframes evaluation around a five-stage pipeline (Plan, Setup, Validate, Inference, Submit) and scores both final task performance and stage-level behavior, enabling diagnosis of failure modes rather than just outcome ranking.
Key Findings
- Multi-track, long-horizon tasks: five research tracks (segmentation, image enhancement, VQA, report generation, lesion detection) with runs averaging ~33 agent turns and thousands of recorded runs, tested under two scaffolding tiers (Lite and Standard). This emphasizes persistent planning, tool use, and iterative verification over single-step answers.
- Stage-level scoring: each run receives S1–S5 scores plus a final task metric, letting evaluators see which workflow phase drives success or failure. That decomposition is the benchmark’s core novelty and diagnostic value.
- Empirical insights: Validate is the weakest stage on average while Setup is the strongest, indicating agents can assemble executables but struggle to verify reliability. Verification and submission failures dominate tagged errors (37.7% and 38.1% respectively); task-understanding errors are rare (0.9%). Runs with a single fired error code score ~48% lower than runs with no errors.
- Two-tier design: Lite and Standard tiers use the same data/metrics but differ in task-brief scaffolding, enabling analysis of agent robustness to brief quality and guidance.
Who it's for and trade-offs
Great fit if you build or evaluate autonomous agents for medical imaging or multimodal inference and need diagnostic metrics that highlight pipeline weaknesses (verification, submission). It’s less suitable as a proxy for clinical utility or regulatory approval—AutoMedBench focuses on agent research workflows and synthetic/benchmarked evaluation, not human-in-the-loop clinical trials. Also, its scope centers on imaging and multimodal tasks; non-imaging clinical workflows (e.g., EHR-centric pipelines) are outside the benchmark’s coverage.
Where it fits
AutoMedBench complements single-output benchmarks by providing a workflow-aware layer: use it when you need to compare agent designs on end-to-end research behavior, track which pipeline stages improve with interventions, or prioritize engineering effort (e.g., improving validator modules versus pipeline orchestration).
