Long-form speech is where many current models fail: short-utterance benchmarks miss temporal reasoning, contextual summarization, and speaker-level analysis required for real conversations. Marco-LongSpeech addresses that gap by assembling large-scale long recordings with diverse, multi-task annotations so models can be evaluated on sustained comprehension rather than isolated clips.
What Sets It Apart
- Multi-task, single-source layout: Each example is packaged with task-specific JSONL files (ASR, Temporal_Relative_QA, summary, content_separation, emotionQA, speaker_count, translation, language_detection), enabling consistent cross-task evaluation and multi-task training without manual alignment overhead. This means you can run the same audio through transcription, summarization, and temporal QA pipelines and compare failure modes directly.
- Long-audio focus with scale: The dataset contains 101,822 unique long WAVs split across three partitions (LongSpeech_p1/p2/p3) and ~204,881 total examples across train/val/test. Compared with short-utterance corpora (e.g., CommonVoice, LibriSpeech), Marco-LongSpeech concentrates on duration and temporal continuity, which is essential for evaluating memory, temporal localization, and coherence.
- Task variety that targets comprehension, not just recognition: Beyond ASR and translation, the included tasks test temporal reasoning (Temporal_Relative_QA), summarization of lengthy content, content-separation (detecting unrelated concatenations), emotion QA, and speaker counting — giving a broader picture of a model’s long-context understanding.
Who It's For and Trade-offs
Great fit if you are: training or evaluating models intended to handle long audio (LLM-based audio understanding, long-context ASR, long-form summarization, temporal QA); building multi-task audio benchmarks; or studying failure modes across tasks on the same audio. The dataset’s Apache-2.0 licensing and JSONL organization make it straightforward to load and integrate. Look elsewhere if you need: very high speaker-verified diarization labels, broad multilingual coverage beyond English/Chinese, or extremely curated studio-quality recordings — Marco-LongSpeech prioritizes scale and long-duration realism over per-sample studio curation. Also verify downstream use constraints and provenance for specific research or commercial uses.
Practical notes
- Provided metadata includes per-partition all_audios.jsonl and metadata.json files and an example-loaded format geared for Hugging Face Datasets and standard JSON loaders.
- Citable paper: arXiv:2601.13539 and a companion GitHub repo (AIDC-AI/Marco-Longspeech) are available for methods and generation details; consult them for annotation methodology and exact license text before redistribution.
