AI Audio2023

fish-speech

Generates expressive multilingual speech from text, with sub-word control over prosody and emotion via inline tags like [whisper] or [angry]. Handles multi-speaker, multi-turn dialogue; the weights ship under a research-only license.

Visit Website

Introduction

Most open text-to-speech projects treat emotion as an afterthought — a neutral voice with maybe a speed slider. Fish Speech inverts that priority: prosody and emotion are steerable at the sub-word level through inline natural-language tags, so a single sentence can slide from [whisper] to [excited] without re-recording. Under the hood sits Fish Audio S2 Pro, a Dual-Autoregressive model trained on 10M+ hours of audio across 80+ languages with reinforcement-learning alignment.

What Sets It Apart

Sub-word emotional control — tags like [angry] or [whisper] apply mid-sentence, giving direction-level nuance instead of one flat delivery per clip.
Native multi-speaker, multi-turn dialogue — generate whole conversations or character audio in a single pass rather than stitching separate renders.
Serving-ready by design — CLI, WebUI, and server inference plus SGLang-Omni and vLLM-Omni recipes mean it drops into production stacks, not just notebooks.
Breadth over 80 languages — the same model handles cross-lingual cloning, so you are not swapping engines per locale.

Who It's For

Great fit if you need expressive, controllable multilingual TTS or voice cloning and can operate within a research-oriented license. Look elsewhere if you need clean commercial redistribution rights out of the box — the Fish Audio Research License restricts how weights and outputs may be used, so read it before shipping anything customer-facing.

Back

Information

Websitegithub.com
OrganizationsFish Audio
AuthorsFish Audio (fishaudio)
Published date2023/10/10

More Items

MCP Server2025

Vexa

Vexa-ai

Runs a self-hosted meeting bot and transcription API that joins Google Meet, Teams and Zoom and streams speaker-attributed transcripts in real time. Compiles meetings into a git-backed Markdown workspace and runs sandboxed agents on your infrastructure; Apache-2.0 and air-gap capable.

stt mcp-server ai-agent ai-api chatbot+8

AI Audio2026

Gepard

Nineninesix, Inc., NVIDIA +1

Generates streaming, low‑latency neural speech for real‑time dialogue by autoregressively producing audio frames as text arrives; joint text–speech training preserves natural prosody. Optimized for vLLM streaming (~50 ms first chunk), supports short‑clip voice cloning and four languages.

tts vllm qwen transformers huggingface+5

AI Audio2026

CohereLabs/cohere-transcribe-arabic-07-2026

CohereLabs

Transcribes Arabic speech to text using a CohereLabs-trained ASR model compatible with the Hugging Face Transformers pipeline. Provides safetensors weights, endpoint compatibility and a DOI-tagged release; suitable for Arabic transcription workflows but may require adaptation for diverse dialects or noisy audio.

ASR speech audio transformers huggingface+4