Long-form, multi-speaker dialogue is where many state-of-the-art single-turn zero-shot TTS models break: stitching independently synthesized turns often loses acoustic consistency, conversational flow, and affective continuity. SwanVoice attacks this gap by combining a purpose-built in-the-wild speech corpus (SwanData-Speech), a training schedule that moves from monologue → mixed → real dialogue, and a modeling stack designed for turn-aware consistency and expressiveness, producing better hierarchical richness in both monologue and dialogue settings. (arxiv.org)
Key Findings
- Training strategy matters: starting from monologue data and progressively exposing the model to mixed and real dialogue preserves single-speaker quality while teaching the model turn-level conversational dynamics. This reduces the typical artifacts introduced when separately synthesizing and stitching turns. (arxiv.org)
- Architecture and conditioning choices enable consistency: a 25 Hz VAE plus raw-text conditioning with pause-aware symbols and pinyin substitution gives the model finer control over prosody and pause structure, which is crucial for long-form coherence. The flow-matching DiT backbone with explicit speaker-turn conditioning helps maintain speaker identity across turns. (arxiv.org)
- Post-training objectives improve perceived expressiveness: applying DiffusionNFT fine-tuning with phone-level and speaker-similarity rewards increases hierarchical richness and speaker similarity in evaluation, though content accuracy remains the main remaining limitation noted by the authors. (arxiv.org)
Who it's for and trade-offs
Great fit if you are building dialogue-aware TTS (virtual agents, audiobooks with multi-voice narration, social robots) and need zero-shot voice cloning that maintains turn-to-turn coherence. Look elsewhere if strict word-for-word content accuracy is the top priority (the authors report content accuracy as the main limitation), or if you require a tiny on-device model—SwanVoice targets expressiveness and conversational fidelity, which can imply larger models and more complex training/fine-tuning pipelines. (arxiv.org)
Where it sits in the ecosystem
SwanVoice fills the gap between single-turn zero-shot TTS (high per-turn quality but poor cross-turn consistency) and dedicated dialogue TTS systems (which often sacrifice monologue quality). By combining corpus construction (SwanData-Speech), model design, and reward-aware fine-tuning, it aims for a balanced solution for both monologue and dialogue scenarios. Demos and project pages are hosted by the SwanAIGC team. (swanaigc.github.io)
