Agentic evaluations are moving beyond isolated reasoning or instruction following to measure how models act in multi-step, tool-using environments — but most benchmarks are English-centric. K-BrowseComp addresses this gap by grounding web-browsing agent challenges in Korean-language contexts and real-world web tasks, exposing linguistic, cultural, and UI-specific failure modes that English benchmarks miss.
Key Findings
- Performance gap: On the 300-problem human-verified subset, top frontier models reach only ~30–46% exact success, while several Korean LLMs score near 0–10%. This quantifies a substantial drop when tasks require Korean grounding and end-to-end web interaction.
- Adversarial diagnostic split: A 100-problem synthetic split (designed with hard few-shot exemplars and failure-mode-targeted generation) yields even lower peak performance (~26% for the best model), demonstrating how creating browsing tasks can be easier or differently hard than solving them.
- Dataset composition matters: The benchmark separates a carefully validated native-speaker subset from a synthetic adversarial split, allowing evaluators to distinguish realistic capability gaps from targeted stress tests.
Who it’s for and trade-offs
Great fit if you evaluate or develop web-browsing agents, multilingual agent skills, or Korean-language LLMs and need realistic, localized tasks that include web UIs, Korean text understanding, and multi-step tool use. Look elsewhere if your focus is purely English benchmarks, single-turn NLP evaluation, or low-cost scale tests — K-BrowseComp’s human-verified split emphasizes fidelity over ultra-high scale and its adversarial split is explicitly designed to stress models rather than reflect typical end-user distributions.
Methodological notes
The authors provide both the data and code, with the verified subset manually constructed and validated by native Korean speakers to ensure cultural and linguistic relevance. The synthetic split uses adversarial filtering and hard exemplar design to target specific failure modes, making it useful as a directed stress test alongside the verified realism of the human-checked tasks.
Overall insight: K-BrowseComp shows that agentic browsing competence does not transfer cleanly across languages and locales — benchmarking localized contexts is essential if you care about real-world agent robustness.
