LogoAIAny
Icon for item

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Proposes TASTE, an automatic pipeline that synthesizes challenging agent benchmark tasks by sampling and evolving valid tool-sequence patterns; uses an adaptive contrastive n-gram model and LLM validity judgments to produce τ^c-Bench with broader tool-use coverage and higher difficulty.

Introduction

Most agent benchmarks become saturated as model capabilities rise; high scores can reflect narrow coverage rather than robust tool-use. This paper flips the usual workflow: instead of writing scenarios in natural language and mapping them to tools, it generates tasks from sampled tool-sequence evolutions, yielding tasks that force agents to combine tools in diverse, hard-to-solve ways. The core insight is that sampling valid tool sequences first exposes a much larger space of realistic tool-usage patterns and — with iterative selection and difficulty evolution — produces benchmarks that better reveal agent weaknesses.

Key Findings
  • TASTE (Task Synthesis from Tool Sequence Evolution) uses an Adaptive Contrastive n-gram model trained on LLM-judged validity signals to sample plausible tool sequences; representative sequences are chosen via clustering and instantiated into full tasks. This pipeline substantially increases both coverage and difficulty.
  • Using TASTE, the authors construct τ^c-Bench, an extension of three domains from τ^2-Bench that more than doubles the number of unique tool combinations agents must execute.
  • Empirical evaluation over 11 agent/user LLM pairs shows dramatic performance drops for models that nearly saturated τ^2-Bench; e.g., Gemini-3-Flash falls from ~0.82–0.94 to ~0.28–0.61 on the new tasks, indicating prior benchmarks had become insufficiently discriminative.
Who it's for and tradeoffs

Great fit if you need continuous, scalable benchmarks that stress tool composition and expose brittle tool-use behavior in agent stacks. It is useful for benchmark designers, research labs evaluating agent generalization, and teams tuning tool-integration policies. Look elsewhere if you require human-authored, semantically nuanced real-world scenarios (TASTE automates instance generation and hinges on LLM validity judgments), or if resource constraints prevent running large-scale sampling and iterative evolution pipelines.

Where it fits

TASTE complements human-written benchmarks: it is not a replacement for curated, domain-expert tasks but is a practical way to expand coverage and avoid early saturation. Consider using τ^c-Bench alongside existing human-authored suites to get a more balanced picture of agent capabilities.

Methodological notes

The approach depends on LLMs to judge sequence validity and on clustering to pick representative sequences before instantiation and difficulty evolution. This makes the pipeline adaptable to new tool sets, but also couples benchmark behavior to the LLMs used for validity signals — a design choice that both enables scale and introduces dependency-driven biases.

Information

  • Websitearxiv.org
  • AuthorsTomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz, Michal Shmueli-Scheuer, Roi Reichert
  • Published date2026/05/27

Categories