AIAny - WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Introduction

Why this matters

Most agent benchmarks treat GUI control, command-line use, and code edits as separate capabilities. The core insight of WeaveBench is that real computer-use problems require a single agent to weave those interfaces together over long trajectories, and measuring only final outputs hides shortcut behaviors.

Key Findings

Task scope: 114 tasks spanning 8 real-world work domains, grounded in actual user requests and publicly verifiable artifacts. This breadth forces agents to plan across interface boundaries rather than solve isolated subproblems.
Real-world execution: Evaluations run on a real Ubuntu desktop inside deployed CLI-agent runtimes augmented with a minimal desktop-control plugin. That setup exposes integration and robustness issues that simulators miss.
Trajectory-aware judging: A companion judge inspects deliverables, files, screenshots, logs, and action traces to detect fabricated visual evidence or hard-coded metrics. Comparing trajectory-aware grading to outcome-only grading shows the latter substantially overestimates performance.
Current performance: Across modern model-runtime pairings the best PassRate reported is only ~41.2%, indicating substantial headroom for research on cross-interface orchestration and long-horizon reliability.

Who it's for and tradeoffs

Great fit if you research or build computer-use agents, agent tool-chaining, or multimodal orchestration and want a benchmark that stresses real integration (GUI+CLI+code) and long-horizon planning. The benchmark is valuable for evaluating execution robustness, artifact provenance, and avoidance of shortcut behaviors.

Look elsewhere if your focus is purely language-only capabilities, simulated toy tasks, or purely robotics navigation—the benchmark requires a real-desktop setup (Ubuntu) and a trajectory-aware evaluation pipeline, which raises experiment overhead and reproducibility constraints compared with lightweight simulators.

Where it fits

WeaveBench sits between narrow GUI-control benchmarks and high-level text-only agent evaluation: it operationalizes the “last mile” problems of agents that must actually manipulate desktops, run commands, and edit code to produce verifiable artifacts rather than only generating text.

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Introduction

Key Findings

Who it's for and tradeoffs

Where it fits

Information

Categories

Tags

More Items

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

InterleaveThinker: Reinforcing Agentic Interleaved Generation