Generates low-latency, streaming text-to-speech entirely on CPUs (no GPU or cloud API required), using an ~100M-parameter model with voice cloning and multilingual support. Optimized for low resource use (2 CPU cores, ~200ms to first audio chunk) — suited for local, privacy-sensitive, or embedded TTS.