LogoAIAny
Icon for item

Pocket TTS

Generates low-latency, streaming text-to-speech entirely on CPUs (no GPU or cloud API required), using an ~100M-parameter model with voice cloning and multilingual support. Optimized for low resource use (2 CPU cores, ~200ms to first audio chunk) — suited for local, privacy-sensitive, or embedded TTS.

Introduction

Most high-quality TTS systems rely on GPUs or hosted APIs; Pocket TTS flips that expectation by delivering near-real-time, voice-cloning capable TTS that runs efficiently on commodity CPUs. The surprising tradeoff here is practical: you can get streaming audio and fast first-chunk latency without needing cloud access or specialized hardware, which changes where TTS can be deployed.

What Sets It Apart
  • CPU-first design: a compact ~100M-parameter model optimized to run on just 2 CPU cores, avoiding the need for GPU infrastructure. This enables on-device, offline deployments and lower operational cost.
  • Low-latency streaming: reported ~200 ms to the first audio chunk and overall throughput ~6x real-time on a MacBook Air M4, making it usable for interactive applications.
  • Voice cloning & fast voice loading: supports cloning from short audio files and exporting/importing voice states (safetensors) for fast reuse.
  • Multi-language support and extensibility: quality English voices and several non-English languages (Italian, Spanish, German, French, Portuguese); larger-layer variants available for higher-quality non-English output.
  • Lightweight UX: provides a Python API, CLI, local HTTP server mode, and browser demos; multiple community ports (ONNX, Rust/Candle, WebAssembly) broaden deployment options.
Who It's For — and Tradeoffs

Great fit if you need local or privacy-sensitive TTS on low-resource machines (edge devices, laptops, small servers), want quick prototyping with a Python/CLI workflow, or need low-cost streaming TTS without cloud dependencies.
Look elsewhere if you require highest-fidelity, large-voice-catalog commercial voices, tighter real-time guarantees at scale (GPU farms + batching), or advanced features like int8 quantized models (currently unsupported). Note: developers observed little GPU speedup for the default configuration due to batch size=1 and model design.

Where It Fits

Pocket TTS sits between tiny rule-based TTS and large, cloud-hosted neural TTS: it trades top-end naturalness for portability and low-latency CPU execution. Use it for offline assistants, accessibility tools, local demos, embedded UI, or as a fallback TTS in privacy-first products.

Implementation & Ecosystem Notes

The project offers a Python library (TTSModel API), CLI commands (generate, serve, export-voice), and community exports (ONNX, WebAssembly, Candle). Official resources include a web demo, Hugging Face model cards and a tech report/paper for deeper details. The repository also lists prohibited uses and licensing info for voice assets.

Information

  • Websitegithub.com
  • AuthorsManu Orsini, Simon Rouard, Gabriel De Marmiesse, Václav Volhejn, Neil Zeghidour, Alexandre Défossez, Kyutai Labs
  • Published date2026/01/07

Categories