Why this matters
Large-scale mid-training datasets that combine long-form video captions with explicit spatial-reasoning examples are uncommon. This collection was assembled to bridge that gap for the LLaVA-OneVision-2 family: it mixes tens of thousands of short video clips with separate spatial reasoning shards and mapping metadata so models can learn both temporal captioning and grounded spatial reference together.
What Sets It Apart
- Mixed video + spatial-reasoning focus: most corpora emphasize either video captions or spatial grounding; this dataset deliberately co-locates both so a single mid-training run can expose a model to temporal captioning, VQA-style queries, and pointing/reference tasks.
- Scale and shard layout for mid-training: the mid_training_video folder includes ~10,809 WebDataset shards of ~60s clips plus caption JSONL splits for 30s/60s/180s/>10min; spatial/ contains 84 WebDataset shards from multiple spatial datasets (refcoco, Visual Genome, pointing tasks, etc.), enabling large-batch streaming during multi-epoch mid-training.
- Provenance mapping: CSV mapping files map each dst_path to source YouTube IDs and start/end times, which simplifies provenance checks, selective re-downloads, and reproducible sampling policies.
- Browser-friendly previews: small Parquet viewer configs provide representative samples (caption and spatial previews) so you can inspect schema and a few examples without downloading the full shards.
Who It's For (and Tradeoffs)
Great fit if you need mid-training material to improve a multimodal LLM's video captioning, VQA, or spatial-reference capabilities at scale — especially when you want streaming WebDataset shards and explicit source mappings. Look elsewhere if you require fully human-annotated, high-precision labels for evaluation (this corpus is designed for mid-training rather than benchmark-quality ground truth), or if you cannot rely on externally hosted video sources (some clips reference YouTube and may be unavailable later). Note: license is Apache-2.0 for the dataset packaging, but individual sourced videos retain their original availability and rights considerations.
Where It Fits
Use this as a mid-training supplement after an initial multimodal alignment stage and before final fine-tuning on high-quality, human-annotated benchmarks. It sits between web-scale noisy caption corpora (for breadth) and curated VQA/RefCOCO evaluation sets (for precision), providing temporal context and spatial reasoning signal in the same training sweep.
How the Data Is Structured
- mid_training_video/: WebDataset tar shards (~10,809 for ~60s clips) and caption JSONL splits for multiple clip durations.
- spatial/: 84 WebDataset shards combining reference and spatial tasks (refcoco, Visual Genome, pointing, 3D-related examples).
- mapping/: CSV files that map dst_path -> youtube_id + [start_time,end_time], enabling reproducible retrieval and selective sampling.
- viewer_* configs: small Parquet preview samples (30s/60s/180s/>10min captions and spatial thumbnails) intended only for schema inspection in the Hugging Face viewer.
Overall, the dataset is optimized for streaming mid-training at scale and for experiments that require both temporal/video captioning and grounded spatial reasoning signals in a single corpus.
