LogoAIAny
Icon for item

Raon-OpenTTS-Pool

Provides a unified 615k-hour English speech corpus for TTS training, aggregating 11 public datasets and web-sourced recordings into 16 kHz Opus WebDataset shards. Includes a quality-filtered core subset (510.1k hours), metadata splits, and mixed licenses across sources.

Introduction

Most large TTS gains come from breadth and scale of training data rather than small architecture tweaks; this dataset answers that need by assembling one of the largest openly distributed English speech pools for TTS research and model training. By standardizing audio to 16 kHz mono Opus and exposing metadata-only splits plus WebDataset tar shards, it enables scalable streaming training and reproducible experiments across diverse public sources.

What Sets It Apart
  • Scale and diversity: 615k hours and ~239.7M segments aggregated from 11 public sources plus web-sourced recordings (Raon-YouTube-Commons), covering audiobooks, broadcast, crowdsourced, and cleaned speech subsets. That scale is intended to support large TTS models that require broad speaker and acoustic variation.
  • Practical storage and streaming format: audio is stored as 16 kHz mono Opus (64 kbps) inside WebDataset tar shards, which reduces storage and eases distributed streaming for large-scale training pipelines. Metadata is provided as parquet files with two splits: pool (all samples) and core (quality-filtered subset).
  • Quality-controlled core subset: a model-driven filtering pipeline (ASR WER, DNSMOS perceptual quality, and VAD speech-activity ratio) removes the lowest 15% by combined rank to produce Raon-OpenTTS-Core (~510.1k hours). The filtering thresholds and metrics (e.g., WER scoring, DNSMOS, SAR) are reported and reproducible from the paper and repo.
Who It's For and Trade-offs

Great fit if you need large, diverse English TTS training data for open research, pretraining or robustness experiments, and if you can handle large-scale storage/IO and mixed licensing. It’s also suitable when you want streaming-friendly WebDataset shards rather than monolithic WAV collections. Look elsewhere if you require guaranteed commercial-only licensing (some sub-datasets are non-commercial or require separate agreements), need high-sample-rate audio (>16 kHz) for fine high-frequency speech detail, or require per-sample human-verified transcripts—many transcripts originate from automatic pipelines and web sources. Two contributor datasets (GigaSpeech and SPGISpeech) are non-redistributable and require separate acceptance and conversion steps.

Notes on use

Metadata splits let you filter for core-only samples at training time; audio tars mix core/pool samples so you must select by sample_key to restrict training. The Raon-YouTube-Commons component was rebuilt through a pipeline (source separation, diarization, VAD, Whisper-large-v3 transcription) to improve alignment over raw YouTube captions. Because licenses vary by sub-dataset, users must comply with each sub-dataset’s terms and may exclude non-commercial parts if needed.

Information

Categories