Proposes TASTE, an automatic pipeline that synthesizes challenging agent benchmark tasks by sampling and evolving valid tool-sequence patterns; uses an adaptive contrastive n-gram model and LLM validity judgments to produce τ^c-Bench with broader tool-use coverage and higher difficulty.
A benchmark for evaluating web-browsing agents in Korean contexts, composed of 400 tasks (300 manually verified by native speakers). Includes a human-verified split and an adversarial synthetic split to probe failure modes; reveals large performance gaps for both frontier and Korean models.