A benchmark for evaluating web-browsing agents in Korean contexts, composed of 400 tasks (300 manually verified by native speakers). Includes a human-verified split and an adversarial synthetic split to probe failure modes; reveals large performance gaps for both frontier and Korean models.