Most progress on world models stalls not from model ideas but from brittle data pipelines and inconsistent evaluation. This platform tackles that gap by providing one consistent interface that covers the three research stages—data collection, model training, and MPC-based evaluation—so experiments are comparable and reproducible across environments and data formats.
What Sets It Apart
- Unified pipeline for collect→train→evaluate: you run the same high-level API across 30+ environments (Atari, DMControl, Gymnasium variants, custom simulated tasks), which removes integration work so researchers can focus on model and objective changes (so what: speeds iteration and reduces hidden implementation drift across baselines).
- Format-agnostic dataset registry with LanceDB, HDF5, video backends: high-throughput readers and conversion utilities let teams store and share episode-contiguous datasets efficiently (so what: reproducible checkpoints and lower I/O overhead for large-scale visual datasets).
- Reference implementations and planners included: out-of-the-box CEM/iCEM/MPPI solvers and published baselines (LeWM, DINO-WM, etc.) let you reproduce prior results and run new comparisons under the same eval protocol (so what: reduces variance from mismatched eval wrappers or custom MPC code).
- Emphasis on factors-of-variation and distribution-shift tests: many envs expose visual and physical factor controls to evaluate zero-shot generalization without extra environment wiring (so what: simplifies robust generalization benchmarks for visual world models).
Who It's For + Tradeoffs
Great fit if you are a research group or practitioner who needs repeatable comparisons of world-model architectures and planning strategies, and wants dataset tooling that scales (LanceDB support, dataset converters, CLI). Look elsewhere if you need a turnkey robot-stack for real hardware (this is focused on simulated/benchmarked environments) or if you require heavily custom environment APIs not compatible with Gymnasium-like interfaces. APIs are actively evolving, so expect minor breaking changes between releases.
Where It Fits
Think of this as the experiment infrastructure layer between model code and evaluation: not a novel model itself but the reproducible scaffolding that makes fair comparisons and dataset sharing practical. It complements model libraries (PyTorch-based research code) and dataset hubs (LanceDB, Hugging Face) by providing standardized collection and MPC evaluation utilities.
How It Works (brief)
Core pieces are: an environment registry with FO V / factor controls, a format registry for dataset read/write (lance/hdf5/video), reference model training scripts, and a solver suite for MPC evaluation. Datasets and checkpoints are stored under a configurable $STABLEWM_HOME, and CLI tools let you inspect/convert datasets without writing glue code.
