Web interaction is one of the hardest places to train grounded agents because pages are heterogeneous, dynamic, and full of UI noise. WebWorldData's core insight is to trade raw HTML for structured accessibility (A11y) trees and multi‑turn trajectories so models learn stateful UI dynamics at scale rather than brittle DOM string patterns.
What Sets It Apart
- Structured state representation: states are given as A11y trees (plus HTML/XML/Markdown variants), which normalizes layout and focusable elements and better aligns model inputs with what assistive tech and interaction policies need. This reduces surface-level noise compared with raw DOM snapshots.
- Trajectory-first format: samples are multi-turn (state, action, next_state) trajectories up to 30 steps with long contexts (reported up to ~30K tokens). That format directly trains world-model objectives (predict next-state given action) rather than single-step supervised labels.
- Collection diversity and scale: the dataset aggregates ~1.06M trajectories from hundreds of thousands of real URLs across domains (tech, e‑commerce, news, education, etc.), and includes both English and Chinese content—useful for multilingual agent training and robustness testing.
- Mixed provenance and filtering: data sources include randomized crawls, LLM-driven autonomous exploration, synthetic task generation, and open-source agent reformatting. Dual-stage filtering (rule-based + LLM URL/content scoring) reduces unsafe content but does not eliminate residual PII or non-deterministic page elements.
Who It's For and Trade-offs
Great fit if you want to pretrain or fine-tune models that need to predict UI state changes, simulate multi-step web tasks, or produce synthetic agent trajectories for downstream RL/IL. It’s also useful for research on world models and environment prediction for browser-based agents.
Look elsewhere if you need perfectly reproducible page snapshots (dynamic ads, A/B tests and client-side变化 mean some trajectories are inherently non-deterministic), strict PII-free corpora (residual PII may remain despite filtering), or very small, curated task suites for controlled benchmarking. Note the dataset is released under Apache‑2.0 and prioritizes breadth and realism over deterministic reproducibility.
