MOSS-TTS-v1.5 focuses on practical, controllable multilingual speech synthesis rather than pushing only raw MOS scores. The release targets two common pain points in modern TTS: stable zero-shot voice cloning across varied reference lengths, and predictable prosody driven by punctuation and explicit pause markers. These choices make it easier to integrate into pipelines that need reproducible voice identity and fine-grained timing control.
Key Capabilities
- Multilingual synthesis with explicit language tags: specifying language in the input boosts fidelity for many non-Chinese/English languages, which reduces code-switching artifacts in mixed-language inputs.
- More stable zero-shot voice cloning: reduced variance across repeated generations and improved speaker similarity, so cloned voices remain consistent in batch production or A/B testing.
- Robust long-reference and short-text cloning: handles scenarios where the available reference audio is much longer than the target text, improving real-world usability for voice transfer workflows.
- Fine-grained prosody & duration control: token-level duration control plus inline pause markers (e.g. "[pause 3.2s]") let you shape timing and silences without editing audio post hoc.
Who It's For and Trade-offs
Great fit if you need reproducible multilingual TTS with voice cloning—teams building dubbing, narration, or multilingual assistants will benefit from the language-tag workflow and pause/duration controls. It’s also useful for researchers comparing cloning stability or prosody control techniques. Look elsewhere if you require a tiny on-device TTS (this model expects heavier transformer backends and recommended PyTorch versions) or if you need an end-to-end hosted API—this distribution is a model checkpoint with code and assumes familiarity with Transformers-based inference and GPU optimizations (FlashAttention optional). Also note dependencies pinned in the project (specific torch/torchaudio builds) which can affect deployment choices.
