LogoAIAny
Icon for item

AppTek Call-Center Dialogues

Benchmarks ASR on long-form English call-center conversations with wide accent coverage; 128.6 hours across 14 accent groups and 16 service domains, designed for segmentation-sensitive evaluation and intended for evaluation/analysis (CC BY‑SA 4.0).

Introduction

Long-form, multi-accent conversational speech exposes weaknesses that short-read benchmarks do not: segmentation sensitivity, sustained disfluencies, role-based turn-taking, and accent-driven error modes. AppTek Call-Center Dialogues was released to provide a controlled, evaluation-only corpus that stresses these real-world factors so researchers and engineers can measure where ASR systems fail in extended, service-oriented interactions.

What Sets It Apart
  • Realistic long-form sessions (5–15 min, avg ~10.4 min, 128.6 hours total). So what: aggregated-session scoring reveals error accumulation and context-related mistakes that utterance-level datasets mask.
  • Broad accent coverage (14 English accent groups, ~8–11 hours per accent). So what: enables per-accent breakdowns and accent-robustness analyses rather than coarse ‘native/non-native’ splits.
  • Conversation structure with split-channel audio (one speaker per file) and role-played agent–customer dialogs across 16 service domains. So what: supports segmentation and diarization experiments plus domain-specific vocabulary evaluation while avoiding real PII.
  • Evaluation-first release with explicit scoring recommendations (Whisper normalization, jiwer WER, Silero VAD segmentation settings). So what: provides a reproducible pipeline and clarifies how segmentation and normalization choices affect WER.
Who It's For and Tradeoffs

Great fit if you need to: evaluate ASR robustness on long, conversational inputs; compare segmentation strategies; analyze per-accent performance or domain-specific failure modes. The dataset’s curated, role-played design and controlled collection make cross-accent comparisons tractable.

Look elsewhere if you need: training data for model fitting (the dataset is intended for evaluation/analysis, not for training), large-scale speaker demographic diversity beyond the present splits, or real customer PII—these are intentionally not included.

Follow the dataset’s recommended pipeline: apply reproducible text normalization (Whisper EnglishTextNormalizer as provided), segment audio with a consistent VAD (Silero VAD recommended: min silence 10s, min speech 0.25s, max speech 30s), and compute session-aggregated WER with jiwer. Always report segmentation and normalization details alongside WER to ensure comparability.

Quick facts
  • Total duration: 128.6 hours (test split)
  • Speakers: 156 across 14 accent groups
  • Conversations: 873 (split-channel audio → 1,746 files)
  • License: CC BY‑SA 4.0

By focusing on long-form conversational dynamics and systematic accent coverage, this benchmark helps surface real-world ASR weaknesses that short-read or scripted corpora tend to hide.

Information

  • Websitehuggingface.co
  • AuthorsEugen Beck, Sarah Beranek, Uma Moothiringote, Daniel Mann, Wilfried Michel, Katie Nguyen, Taylor Tragemann, AppTek.ai
  • Published date2026/04/23

Categories