AIAny - MLX-Audio

MLX-Audio

MLX-Audio is an audio-focused machine learning library designed to run efficiently on Apple Silicon using the MLX framework. It bundles end-to-end audio capabilities — Text-to-Speech (TTS), Speech-to-Text (STT/ASR) and Speech-to-Speech (STS) — and provides tooling for inference, model conversion/quantization, a command-line interface, a Python API, and an OpenAI-compatible REST API plus a web UI for interactive use.

Key characteristics

Apple Silicon optimized: inference paths and implementations are tuned for M-series chips to deliver fast, low-latency audio generation and transcription on macOS/iOS devices.
Multi-capability: supports TTS, STT, and STS across multiple model families (Kokoro, Qwen3-TTS, Whisper variants, Parakeet, etc.).
Voice customization: includes voice presets, voice cloning via reference audio, and expressive controls (speed, emotion/voice design where supported).
Quantization & conversion: built-in conversion/quantization tooling (3/4/6/8-bit and dtype options) to reduce model size and improve runtime performance.
Developer ergonomics: Python API, CLI, Swift package for mobile integration, and a web interface for demos and quick testing.

Supported workflows

TTS: fast multilingual TTS (example model: Kokoro-82M-bf16) with many voice presets and language codes.
STT/ASR: supports models like Whisper-large-v3 variants and other ASR models for multi-language transcription, including diarization and timestamps in some backends.
STS / audio processing: source separation, enhancement and speech-to-speech pipelines using dedicated models (SAM-Audio, MossFormer2 SE, etc.).

Typical usage

Command-line and Python API make it easy to generate speech, transcribe audio, convert and quantize models, run a local API server and start the web UI.
The package exposes an OpenAI-compatible HTTP API for TTS and STT endpoints, enabling integration with existing tools that expect that format.

Example snippets

CLI TTS generation:

mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello, world!" --voice af_heart --speed 1.0

Python model load & generate:

from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Kokoro-82M-bf16")
for res in model.generate("Welcome to MLX-Audio!", voice="af_heart"):
    audio = res.audio

Run local server & web UI:

mlx_audio.server --host 0.0.0.0 --port 8000
# then in ui/: npm install && npm run dev

Target audience & use cases

MLX-Audio is suitable for developers building on-device speech features for macOS/iOS, researchers/prototypers experimenting with TTS/ASR/STS on Apple Silicon, and teams deploying speech services that benefit from quantized, high-performance models and an OpenAI-compatible API surface.

Requirements & ecosystem

Python 3.10+
Apple Silicon (M1/M2/M3/M4) for best performance
MLX framework
ffmpeg for MP3/FLAC encoding

License & citation

The project is MIT-licensed and provides a suggested citation (author: Prince Canuma). The repository includes conversion/quantization scripts and links to pre-trained model repos used as backends.

MLX-Audio

Introduction

MLX-Audio

Key characteristics

Supported workflows

Typical usage

Example snippets

Target audience & use cases

Requirements & ecosystem

License & citation

Information

Categories

Tags

More Items

NeuTTS

SAM-Audio

Amphion