Most video datasets trade scale for deterministic geometry: large real-world collections have diversity but limited ground-truth depth and camera metadata, while small synthetic corpora provide perfect supervision but limited variety. SDG-SynHuman flips that tradeoff by delivering both scale and structured, deterministic geometry at world-model training scales.
What Sets It Apart
- Paired geometry + photorealism at scale: 236,937 clips (~5,841 hours) rendered at 1080p/30fps with metric depth and per-frame camera intrinsics/extrinsics. That pairing lets models learn geometry-aware representations without costly real-world depth capture.
- Rich, controllable variation: 4,050 digital-human assets, 8,184 animations, ~400 distinct environments (198 indoor + 200 outdoor), and 14 camera-motion scenarios. The dataset intentionally samples camera trajectories and secondary motion layers (breathing, drift, shake, etc.) to evaluate camera-motion generalization and robustness.
- Simulation-first provenance: everything is procedurally generated via NVIDIA’s Synthetic Data Generation pipeline and USD scene graphs, so annotations are deterministic, reproducible, and free of PII or real-subject imagery.
- Designed for world-model workflows: data packaging includes contiguous clips (60–120s), lossless metric depth, per-sample camera JSON, and scene metadata — enabling experiments in pretraining, post-training, and ablation studies where control over camera and scene factors matters.
Key Capabilities
- Geometry-supervised video pretraining: paired RGB + metric depth + camera parameters enable training of depth-aware video encoders, self-supervised world models that factor camera motion, and methods that require metric supervision for physical reasoning.
- Camera-motion evaluation and domain gap studies: the explicit primary/secondary motion taxonomy (static, egocentric, tracking, flythrough, arc, zigzag, birdseye, plus breathing/drift/shake, etc.) makes it straightforward to measure a model’s sensitivity to camera behaviors and to create controlled generalization tests.
- Scale for foundation models: with hundreds of thousands of temporally coherent clips and ~631M RGB frames, the dataset is sized for pretraining or large-scale fine-tuning of video foundation/world models.
Who It’s For — and Tradeoffs
Great fit if you are: researchers or engineers training or evaluating video world models, depth-aware video encoders, camera-control generalization, or physical-AI agents that need deterministic geometry and repeatable scene configs. The dataset's structured metadata simplifies synthetic-to-real transfer studies and ablation experiments on camera and scene factors.
Look elsewhere if: your target task strictly requires real-world appearance statistics (e.g., subtle sensor noise patterns, real human idiosyncrasies) for final deployment; the dataset is intended as a complementary synthetic corpus and the authors explicitly recommend validating on representative real data before production use. Also avoid for biometric identification or real-person surveillance tasks — the dataset is synthetic and not intended for those uses.
Where It Fits
Use SDG-SynHuman as a high-fidelity synthetic supplement during pretraining/post-training or when you need deterministic ground-truth depth and camera metadata at scale. For final validation, pair with real-world benchmarks to measure domain gap and robustness.
