Why this matters
Egocentric datasets often force a trade-off between scale and geometric fidelity. Stera-10M breaks that trade-off by providing hour-plus continuous, commodity-hardware capture with per-frame depth, 6-DoF camera trajectories, and dense two‑hand annotations—data modalities that many large-scale egocentric corpora lack simultaneously. That combination makes it useful when models need long-horizon, geometrically grounded signals rather than short, isolated clips.
What Sets It Apart
- Multimodal continuity: 200 hours across 584 sessions (mean ~20.5 min, longest 104 min) with synchronized RGB, LiDAR depth, IMU, and ARKit 6‑DoF poses—so you can train models on long temporal context and consistent geometry.
- Dense hand and scene grounding: 21-joint MANO two-hand mocap plus session-level room meshes—so manipulation, hand-object interaction, and real-to-sim pipelines can use anatomically consistent, spatially anchored supervision.
- Hierarchical language supervision: annotations at session, sub-goal, episode, and atomic-action levels—so vision-language-action (VLA) pretraining and temporal action modeling benefit from multi-scale semantics.
- Commodity-first design: captured end-to-end on iPhone Pro and distributed with open Stera SDK and exporters—so labs can reproduce capture, extend the corpus, or collect compatible data with off-the-shelf hardware.
Who It's For & Trade-offs
Great fit if you need long-horizon, geometry-aware, multimodal signals for: embodied-AI/world-model training, imitation learning and long-horizon manipulation, hand-object interaction research, or camera-pose / SLAM benchmarking. The dataset’s size (~1.6 TB) and gated access process mean you should be prepared for heavy storage and controlled-access workflows. Also note per-frame RGB is 1280×720 @15fps—suitable for many perception tasks but not for high‑fps or ultra‑high-resolution needs.
Where It Fits
Compared with large heterogeneous corpora (e.g., datasets that prioritize scale without consistent depth/pose) Stera-10M emphasizes geometric consistency and continuous sessions; compared with short high-fidelity captures on research hardware, it emphasizes accessibility on commodity phones and long-session continuity. Use it when you need reproducible, geometry-rich, multimodal egocentric data captured on widely available devices.
