AIAny - AppTek Call-Center Dialogues

Long-form, multi-accent conversational speech exposes weaknesses that short-read benchmarks do not: segmentation sensitivity, sustained disfluencies, role-based turn-taking, and accent-driven error modes. AppTek Call-Center Dialogues was released to provide a controlled, evaluation-only corpus that stresses these real-world factors so researchers and engineers can measure where ASR systems fail in extended, service-oriented interactions.

What Sets It Apart

Realistic long-form sessions (5–15 min, avg ~10.4 min, 128.6 hours total). So what: aggregated-session scoring reveals error accumulation and context-related mistakes that utterance-level datasets mask.
Broad accent coverage (14 English accent groups, ~8–11 hours per accent). So what: enables per-accent breakdowns and accent-robustness analyses rather than coarse ‘native/non-native’ splits.
Conversation structure with split-channel audio (one speaker per file) and role-played agent–customer dialogs across 16 service domains. So what: supports segmentation and diarization experiments plus domain-specific vocabulary evaluation while avoiding real PII.
Evaluation-first release with explicit scoring recommendations (Whisper normalization, jiwer WER, Silero VAD segmentation settings). So what: provides a reproducible pipeline and clarifies how segmentation and normalization choices affect WER.

Who It's For and Tradeoffs

Great fit if you need to: evaluate ASR robustness on long, conversational inputs; compare segmentation strategies; analyze per-accent performance or domain-specific failure modes. The dataset’s curated, role-played design and controlled collection make cross-accent comparisons tractable.

Look elsewhere if you need: training data for model fitting (the dataset is intended for evaluation/analysis, not for training), large-scale speaker demographic diversity beyond the present splits, or real customer PII—these are intentionally not included.

Recommended Evaluation Practices

Follow the dataset’s recommended pipeline: apply reproducible text normalization (Whisper EnglishTextNormalizer as provided), segment audio with a consistent VAD (Silero VAD recommended: min silence 10s, min speech 0.25s, max speech 30s), and compute session-aggregated WER with jiwer. Always report segmentation and normalization details alongside WER to ensure comparability.

Quick facts

Total duration: 128.6 hours (test split)
Speakers: 156 across 14 accent groups
Conversations: 873 (split-channel audio → 1,746 files)
License: CC BY‑SA 4.0

By focusing on long-form conversational dynamics and systematic accent coverage, this benchmark helps surface real-world ASR weaknesses that short-read or scripted corpora tend to hide.

AppTek Call-Center Dialogues

Introduction

What Sets It Apart

Who It's For and Tradeoffs

Recommended Evaluation Practices

Quick facts

Information

Categories

Tags

More Items

Nemotron-Personas-El-Salvador

PhysicalAI-WorldModel-Synthetic-Physical-Interaction-Scenes

Claude Mythos Distilled 25K