LogoAIAny
Icon for item

MLX-Audio

MLX-Audio is an audio processing library built on Apple's MLX framework, optimized for Apple Silicon. It provides fast TTS, STT and speech-to-speech (STS) capabilities, multi-model and multilingual support, voice cloning/customization, quantization for efficient inference, and an OpenAI-compatible API and web UI.

Introduction

MLX-Audio

MLX-Audio is an audio-focused machine learning library designed to run efficiently on Apple Silicon using the MLX framework. It bundles end-to-end audio capabilities — Text-to-Speech (TTS), Speech-to-Text (STT/ASR) and Speech-to-Speech (STS) — and provides tooling for inference, model conversion/quantization, a command-line interface, a Python API, and an OpenAI-compatible REST API plus a web UI for interactive use.

Key characteristics
  • Apple Silicon optimized: inference paths and implementations are tuned for M-series chips to deliver fast, low-latency audio generation and transcription on macOS/iOS devices.
  • Multi-capability: supports TTS, STT, and STS across multiple model families (Kokoro, Qwen3-TTS, Whisper variants, Parakeet, etc.).
  • Voice customization: includes voice presets, voice cloning via reference audio, and expressive controls (speed, emotion/voice design where supported).
  • Quantization & conversion: built-in conversion/quantization tooling (3/4/6/8-bit and dtype options) to reduce model size and improve runtime performance.
  • Developer ergonomics: Python API, CLI, Swift package for mobile integration, and a web interface for demos and quick testing.
Supported workflows
  • TTS: fast multilingual TTS (example model: Kokoro-82M-bf16) with many voice presets and language codes.
  • STT/ASR: supports models like Whisper-large-v3 variants and other ASR models for multi-language transcription, including diarization and timestamps in some backends.
  • STS / audio processing: source separation, enhancement and speech-to-speech pipelines using dedicated models (SAM-Audio, MossFormer2 SE, etc.).
Typical usage
  • Command-line and Python API make it easy to generate speech, transcribe audio, convert and quantize models, run a local API server and start the web UI.
  • The package exposes an OpenAI-compatible HTTP API for TTS and STT endpoints, enabling integration with existing tools that expect that format.
Example snippets
  • CLI TTS generation:
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello, world!" --voice af_heart --speed 1.0
  • Python model load & generate:
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Kokoro-82M-bf16")
for res in model.generate("Welcome to MLX-Audio!", voice="af_heart"):
    audio = res.audio
  • Run local server & web UI:
mlx_audio.server --host 0.0.0.0 --port 8000
# then in ui/: npm install && npm run dev
Target audience & use cases

MLX-Audio is suitable for developers building on-device speech features for macOS/iOS, researchers/prototypers experimenting with TTS/ASR/STS on Apple Silicon, and teams deploying speech services that benefit from quantized, high-performance models and an OpenAI-compatible API surface.

Requirements & ecosystem
  • Python 3.10+
  • Apple Silicon (M1/M2/M3/M4) for best performance
  • MLX framework
  • ffmpeg for MP3/FLAC encoding
License & citation

The project is MIT-licensed and provides a suggested citation (author: Prince Canuma). The repository includes conversion/quantization scripts and links to pre-trained model repos used as backends.

Information

  • Websitegithub.com
  • AuthorsPrince Canuma (Blaizzy)
  • Published date2024/11/27

Categories