Why this matters
Genomic foundation models need both scale and domain diversity: this corpus combines large, curated eukaryote and prokaryote genomic chunks with processed mRNA transcripts to deliver biological signal at scale (≈1.1 trillion nucleotides, 180B tokens at a 6‑mer tokenizer). That mixture makes it practical to train or evaluate DNA/RNA language models that learn species-, gene-, and splice-aware patterns without starting from disparate raw sources.
What Sets It Apart
- Multi-domain, large-scale mixture — eukaryote, prokaryote, and two mRNA variants (including promoter/splice-augmented transcripts). So what: models see both genome context and processed transcript structure, improving generalization to tasks like gene prediction or splice-site recognition.
- Practical production formats — parquet files, streaming-friendly dataset config, and a pre-sampled eukaryote_10B subset. So what: teams can prototype on the 10B-token subset, then scale to the full 1.1T-nucleotide mixture without reassembly overhead.
- Metadata conditioning for eukaryotes — optional species and gene_type tags are prepended with dropout during tokenization. So what: enables conditional generation or targeted fine-tuning (e.g., species-specific sequence synthesis) without forcing models to rely on metadata.
- Upstream license mix documented per subset (MIT for GENERator-derived eukaryote parts; Apache‑2.0 for Evo2 sources). So what: reuse and redistribution depend on the subset — important for downstream model release and data audits.
Who it's for & tradeoffs
Great fit if you are building or benchmarking genomic LMs, training foundation models like Carbon, or researching sequence-level pretraining strategies and tokenizer designs. The dataset lowers engineering burden by aggregating upstream sources and providing parity configs for small-to-large experiments.
Look elsewhere if you need clinical human‑subject sequencing with controlled-consent metadata, very long contiguous chromosomes (many eukaryote sequences are filtered ≤100 kbp), or fully uniform licensing across all shards. Also note genomic data can have privacy and ethical considerations — confirm provenance and license for any sensitive samples before model release.
Where it fits
Use this corpus as a pretraining backbone for biology‑aware transformer models, to generate benchmark training mixtures, or as a reproducible data split when comparing tokenizers and context/windows strategies in DNA/RNA modeling.
