Most multimodal SFT datasets focus on natural images or synthetic captions; geospatial supervision is still sparse yet critical for earth-observation tasks. This snapshot packages satellite RGB chips with aligned per-pixel land-cover rasters and human-readable per-tile captions so visual-language models can learn grounded land-cover descriptions and bounding-box grounding from real satellite sources.
What Sets It Apart
- Tile-level SFT rows: each JSONL row aligns an RGB chip with one or more text messages (global caption, per-class captions, optional bounding-box overlays) ready for SFT-style supervised fine-tuning. This reduces pre-processing friction for training multimodal models.
- Geospatial provenance and alignment: optical tiles are sourced from Sentinel-2 previews and labels come from Earth Engine rasters reprojected to the same 10 m grid, then tiled and downsampled with masks—helpful when you need spatially consistent label alignment rather than heuristic bbox-only labels.
- Optional context and visualization: the snapshot includes optional Mapbox stills per POI and overlay PNGs for bbox visualization, which eases qualitative evaluation and can be used for instruction-style context rows.
- Hub-friendly layout: images, overlays, map stills, metadata, and sharded JSONL splits are pre-organized and constrained to Hub file limits, simplifying large-dataset downloads and streaming.
Who It's For — Tradeoffs
Great fit if you want a ready-made dataset to fine-tune multimodal/vision-language models on land-cover description, grounding, or bounding-box supervision using real Sentinel‑2 imagery. It suits research and prototyping for remote-sensing captioning, weakly-supervised grounding, or instruction-tuned geospatial models.
Look elsewhere if you need fully labeled instance segmentation or high-resolution commercial imagery—labels are per-pixel land-cover rasters reprojected and nearest-neighbor downsampled (10 m grid), and some assets (Mapbox stills, STAC sources) carry external ToS that you must respect.
Where It Fits
This snapshot sits between classic remote-sensing raster datasets (dense land-cover labels) and general vision-language SFT corpora: it supplies georeferenced image–text pairs that are specifically formatted for SFT workflows, not raw research-only rasters or purely vector GIS products.
How It Was Produced
Tiles derive from Sentinel‑2 visual previews (blue/green/red or visual), POIs come from a GeoGuessr-style pano dataset for contextual locations, and label rasters originate from Earth Engine exports. Rasters were reprojected to the 10 m RGB grid, tiled, and nearest-neighbor downsampled; JSONL rows include metadata sidecars (coordinates, scene ID, per-class counts) and splits hashed by poi_id for train/validation/test separation.
Use this dataset with attention to the original STAC/Earth Engine/Mapbox terms and the HF pano dataset license for coordinate provenance.
