Small, schema-faithful samples are surprisingly valuable for building reproducible baselines and validating data pipelines for recommenders. This demo mirrors TAAC2026's production feature layout in a compact Parquet — 1,000 rows and ~39 MB with 120 top-level columns — letting you test feature alignment, sequence handling, and I/O behavior before scaling to full datasets.
What Sets It Apart
- Flat column layout: all features are top-level columns (no nested structs), which simplifies vectorized reads and common preprocessing logic used in production recommender pipelines. This reduces schema-mismatch bugs when porting code from demo to full data.
- Realistic feature mix: includes ID/label fields, 46 user integer features (scalars and arrays), 10 user dense arrays, 14 item ints, and 45 domain sequence features across four behavioral domains — useful for end-to-end feature parsing and sequence truncation strategies.
- Compact but representative: at ~39 MB and 1,000 rows it’s small enough for CI and local experiments, yet preserves column diversity and sequence shapes found in larger TAAC2026 data.
Who It's For & Tradeoffs
Great fit if you need a lightweight, schema-accurate dataset to validate data ingestion, offline feature engineering, or model input pipelines for recommendation tasks. It’s ideal for unit tests, debugging feature alignment, and trying sequence batching/truncation logic.
Look elsewhere if you need large-scale training data, long-tail item coverage, or statistically representative user populations — this is a demo sample (CC BY-NC 4.0) intended for development and evaluation, not for production training at scale.
Where It Fits
Use this dataset as a staging artifact: confirm preprocessing code, verify schema/column naming, and benchmark I/O and memory patterns before switching to the full TAAC2026 release or tournament feed.
