Most progress on video-based physical reasoning stalls when datasets lack precise, per-object physics labels or replayable scenes. This collection addresses that gap by providing reproducible Isaac Sim scenes with frame-level physics states and multiple visual modalities, letting researchers train and benchmark models that must predict both visual outcomes and underlying physical quantities.
What Sets It Apart
- Reproducible, simulation-backed annotations: every clip is tied to a USD scene and deterministic seed so runs can be replayed exactly — valuable for debugging learned dynamics and for ablation studies. This differs from many real-world video datasets that offer only sparse labels.
- Physics-native labels, not proxies: per-object velocity, angular velocity, center-of-mass displacement and cumulative rotation are exported alongside per-frame segmentation and lossless depth, enabling supervised training on physics targets (NPZ) and auxiliary visual objectives (depth, flow, segmentation).
- Multi-scenario coverage tuned for dynamics: ten procedurally parameterized scene families (domino chains, ball mixers, billiards, towers, wrecking-ball, ramps, etc.) emphasize different phenomena (chains, multi-body collisions, constrained dynamics) so models learn transferable physical primitives instead of overfitting a single interaction type.
Who it's for — and tradeoffs
Great fit if you are training world models, physics-aware video predictors, or depth/flow/segmentation networks that benefit from dense, deterministic ground truth and want to test simulation-to-real transfer. It’s also useful for physics-grounded captioning research because captions are grounded in actual physics values. Look elsewhere if your target is purely in-the-wild visual diversity or human-centric scenes: the dataset is fully synthetic (no real imagery), is large in storage (~15 TB across modalities), and requires compute and storage infrastructure to host and preprocess the lossless depth and per-frame PNG masks.
Where it fits
Use this dataset to bootstrap models that learn dynamics priors or to generate synthetic pretraining for downstream real-data fine-tuning. Combine with real-world datasets for domain-adaptation workflows; treat it as a physics-rich complement rather than a drop-in replacement for natural video benchmarks.
