MLX-Audio
MLX-Audio is an audio-focused machine learning library designed to run efficiently on Apple Silicon using the MLX framework. It bundles end-to-end audio capabilities — Text-to-Speech (TTS), Speech-to-Text (STT/ASR) and Speech-to-Speech (STS) — and provides tooling for inference, model conversion/quantization, a command-line interface, a Python API, and an OpenAI-compatible REST API plus a web UI for interactive use.
Key characteristics
- Apple Silicon optimized: inference paths and implementations are tuned for M-series chips to deliver fast, low-latency audio generation and transcription on macOS/iOS devices.
- Multi-capability: supports TTS, STT, and STS across multiple model families (Kokoro, Qwen3-TTS, Whisper variants, Parakeet, etc.).
- Voice customization: includes voice presets, voice cloning via reference audio, and expressive controls (speed, emotion/voice design where supported).
- Quantization & conversion: built-in conversion/quantization tooling (3/4/6/8-bit and dtype options) to reduce model size and improve runtime performance.
- Developer ergonomics: Python API, CLI, Swift package for mobile integration, and a web interface for demos and quick testing.
Supported workflows
- TTS: fast multilingual TTS (example model: Kokoro-82M-bf16) with many voice presets and language codes.
- STT/ASR: supports models like Whisper-large-v3 variants and other ASR models for multi-language transcription, including diarization and timestamps in some backends.
- STS / audio processing: source separation, enhancement and speech-to-speech pipelines using dedicated models (SAM-Audio, MossFormer2 SE, etc.).
Typical usage
- Command-line and Python API make it easy to generate speech, transcribe audio, convert and quantize models, run a local API server and start the web UI.
- The package exposes an OpenAI-compatible HTTP API for TTS and STT endpoints, enabling integration with existing tools that expect that format.
Example snippets
- CLI TTS generation:
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello, world!" --voice af_heart --speed 1.0- Python model load & generate:
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Kokoro-82M-bf16")
for res in model.generate("Welcome to MLX-Audio!", voice="af_heart"):
audio = res.audio- Run local server & web UI:
mlx_audio.server --host 0.0.0.0 --port 8000
# then in ui/: npm install && npm run devTarget audience & use cases
MLX-Audio is suitable for developers building on-device speech features for macOS/iOS, researchers/prototypers experimenting with TTS/ASR/STS on Apple Silicon, and teams deploying speech services that benefit from quantized, high-performance models and an OpenAI-compatible API surface.
Requirements & ecosystem
- Python 3.10+
- Apple Silicon (M1/M2/M3/M4) for best performance
- MLX framework
- ffmpeg for MP3/FLAC encoding
License & citation
The project is MIT-licensed and provides a suggested citation (author: Prince Canuma). The repository includes conversion/quantization scripts and links to pre-trained model repos used as backends.
