LogoAIAny
Icon for item

SenseNova-SI-8M

Curated multimodal training corpus for spatial intelligence: ~8.16M QA-style samples paired with ~2.72M unique images (≈1.1 TB). Provides JSONL annotations, a 1,000-sample preview, and 52 independent image archives — used to train SenseNova-SI models.

Introduction

Most multimodal models still struggle with spatial reasoning that humans find trivial. This dataset was assembled to explicitly teach spatial intelligence at scale by pairing natural-language, dialogue-style QA with diverse real-world images under a rigorous spatial taxonomy — letting researchers isolate spatial capabilities from generic vision-and-language noise.

What Sets It Apart
  • Large, spatially-focused scale: ~8.16 million annotated samples and ~2.72 million unique images concentrated on spatial tasks, not generic captioning — so what: enables controlled scaling analyses and trains models to generalize spatial relations rather than exploit linguistic shortcuts.
  • Dialogue-style, multimodal entries: annotations are stored as conversational turns with inline <image> placeholders (JSONL), matching how many deployed assistants receive multimodal queries — so what: facilitates training and evaluation of models that must handle interleaved visual context and dialog.
  • Practical distribution for research: image data is packaged as 52 independent ~21 GB zips and a 1,000-sample parquet preview — so what: simplifies selective download, reproducible experiments, and large-scale training without custom archive tooling.
  • Benchmark integration and released models: tied to a CVPR 2026 paper and used to train the recommended SenseNova-SI-1.1-InternVL3-8B model — so what: provides both a dataset and a reference model/metrics to compare new methods on spatial intelligence.
Who It's For and Trade-offs

Great fit if you are training or evaluating multimodal models specifically for spatial reasoning and vision-question-answering, want a large-scale dataset for scaling-law studies, or need a production-scale image corpus with conversational annotations. Look elsewhere if you need densely annotated per-pixel labels (this dataset focuses on image-level QA/dialog annotations), very small-footprint datasets, or non-English primary data (this release is English-centric). The 1.1 TB image archive also means significant storage and bandwidth requirements for full-scale experiments.

Where It Fits

Use this dataset when the research question centers on spatial relation understanding, spatial chain-of-thought, or improving VQA models' grounded reasoning. For tasks requiring instance segmentation, 3D geometry ground-truth, or multilingual coverage, complement SenseNova-SI-8M with task-specific datasets instead.

Information

Categories