Long-form, multi-accent conversational speech exposes weaknesses that short-read benchmarks do not: segmentation sensitivity, sustained disfluencies, role-based turn-taking, and accent-driven error modes. AppTek Call-Center Dialogues was released to provide a controlled, evaluation-only corpus that stresses these real-world factors so researchers and engineers can measure where ASR systems fail in extended, service-oriented interactions.
What Sets It Apart
- Realistic long-form sessions (5–15 min, avg ~10.4 min, 128.6 hours total). So what: aggregated-session scoring reveals error accumulation and context-related mistakes that utterance-level datasets mask.
- Broad accent coverage (14 English accent groups, ~8–11 hours per accent). So what: enables per-accent breakdowns and accent-robustness analyses rather than coarse ‘native/non-native’ splits.
- Conversation structure with split-channel audio (one speaker per file) and role-played agent–customer dialogs across 16 service domains. So what: supports segmentation and diarization experiments plus domain-specific vocabulary evaluation while avoiding real PII.
- Evaluation-first release with explicit scoring recommendations (Whisper normalization, jiwer WER, Silero VAD segmentation settings). So what: provides a reproducible pipeline and clarifies how segmentation and normalization choices affect WER.
Who It's For and Tradeoffs
Great fit if you need to: evaluate ASR robustness on long, conversational inputs; compare segmentation strategies; analyze per-accent performance or domain-specific failure modes. The dataset’s curated, role-played design and controlled collection make cross-accent comparisons tractable.
Look elsewhere if you need: training data for model fitting (the dataset is intended for evaluation/analysis, not for training), large-scale speaker demographic diversity beyond the present splits, or real customer PII—these are intentionally not included.
Recommended Evaluation Practices
Follow the dataset’s recommended pipeline: apply reproducible text normalization (Whisper EnglishTextNormalizer as provided), segment audio with a consistent VAD (Silero VAD recommended: min silence 10s, min speech 0.25s, max speech 30s), and compute session-aggregated WER with jiwer. Always report segmentation and normalization details alongside WER to ensure comparability.
Quick facts
- Total duration: 128.6 hours (test split)
- Speakers: 156 across 14 accent groups
- Conversations: 873 (split-channel audio → 1,746 files)
- License: CC BY‑SA 4.0
By focusing on long-form conversational dynamics and systematic accent coverage, this benchmark helps surface real-world ASR weaknesses that short-read or scripted corpora tend to hide.
