LogoAIAny
Icon for item

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Benchmark for long-horizon computer-use agents that must orchestrate GUI, CLI, and code operations within single trajectories across 114 real-world tasks. Evaluated on a real Ubuntu desktop and paired with a trajectory-aware judge that inspects deliverables, artifacts, and action traces—revealing a top PassRate of ~41.2%.

Introduction

Why this matters

Most agent benchmarks treat GUI control, command-line use, and code edits as separate capabilities. The core insight of WeaveBench is that real computer-use problems require a single agent to weave those interfaces together over long trajectories, and measuring only final outputs hides shortcut behaviors.

Key Findings
  • Task scope: 114 tasks spanning 8 real-world work domains, grounded in actual user requests and publicly verifiable artifacts. This breadth forces agents to plan across interface boundaries rather than solve isolated subproblems.
  • Real-world execution: Evaluations run on a real Ubuntu desktop inside deployed CLI-agent runtimes augmented with a minimal desktop-control plugin. That setup exposes integration and robustness issues that simulators miss.
  • Trajectory-aware judging: A companion judge inspects deliverables, files, screenshots, logs, and action traces to detect fabricated visual evidence or hard-coded metrics. Comparing trajectory-aware grading to outcome-only grading shows the latter substantially overestimates performance.
  • Current performance: Across modern model-runtime pairings the best PassRate reported is only ~41.2%, indicating substantial headroom for research on cross-interface orchestration and long-horizon reliability.
Who it's for and tradeoffs

Great fit if you research or build computer-use agents, agent tool-chaining, or multimodal orchestration and want a benchmark that stresses real integration (GUI+CLI+code) and long-horizon planning. The benchmark is valuable for evaluating execution robustness, artifact provenance, and avoidance of shortcut behaviors.

Look elsewhere if your focus is purely language-only capabilities, simulated toy tasks, or purely robotics navigation—the benchmark requires a real-desktop setup (Ubuntu) and a trajectory-aware evaluation pipeline, which raises experiment overhead and reproducibility constraints compared with lightweight simulators.

Where it fits

WeaveBench sits between narrow GUI-control benchmarks and high-level text-only agent evaluation: it operationalizes the “last mile” problems of agents that must actually manipulate desktops, run commands, and edit code to produce verifiable artifacts rather than only generating text.

Information

  • Websitearxiv.org
  • AuthorsWanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan
  • Published date2026/06/08

Categories