Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

copilot

course

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

physics

pi

plugin

polars

postgres

privacy

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

retrieval

RL

robotics

rust

science

security

segmentation

shodan

skillkit

sora

speech

sqlite

ssh

stt

swe

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

web-search

windsurf

xAI

xai

AI Audio2019

NVIDIA NeMo Speech

NVIDIA

Provides a toolkit and codebase for building, training, and deploying speech and multimodal models — Automatic Speech Recognition, Text-to-Speech, and speech-aware LLMs — with modular neural components and pre-trained checkpoints for PyTorch. Supports streaming/low-latency inference, multi-language models, and optional compiled kernels for acceleration.

nvidia speech ASR tts pytorch+6

AI Audio2022

Whisper

OpenAI

Multilingual sequence-to-sequence speech model and toolkit for speech recognition, speech-to-text translation, and language identification. Offers several model sizes (tiny → large/turbo) for different speed/accuracy trade-offs and ships with a CLI and Python API for offline transcription workflows.

openai speech ASR multilingual pytorch+4

AI Audio2023

pyannote/speaker-diarization-3.1

pyannote

Performs speaker diarization (who spoke when) with pyannote-audio: combines voice-activity detection, speaker-change and overlapped-speech detection to produce time-stamped speaker segments; compatible with Hugging Face Endpoints and ASR pipelines.

huggingface audio speech ASR pytorch+2

AI Audio2024

Meetily

sujithatzackriya, safvanatzack +6Zackriya-Solutions, Meetily (meetily.ai)

Captures, transcribes, and summarizes meetings entirely on the user's machine with real-time local transcription and speaker diarization. Privacy-first design keeps audio, transcripts, and models local; supports Ollama, Claude, Groq, OpenRouter or custom OpenAI-compatible endpoints.

rust ollama stt speech privacy+6

AI Audio2025

OpenSuperWhisper

Starmel

Provides real-time, local audio recording and transcription on macOS using Whisper and Parakeet engines, with global hotkeys and hold-to-record behavior. Includes model download, microphone selection, drag-and-drop file transcription, multilingual auto-detection and Asian-language autocorrect; Apple Silicon only.

stt speech audio voice github+3

MCP Server2025

Vexa

Vexa-ai

Runs a self-hosted meeting bot and transcription API that joins Google Meet, Teams and Zoom and streams speaker-attributed transcripts in real time. Compiles meetings into a git-backed Markdown workspace and runs sandboxed agents on your infrastructure; Apache-2.0 and air-gap capable.

stt mcp-server ai-agent ai-api chatbot+8

AI Client2025

Read Frog

mengxi-ream

Turns web reading into an in-context language-learning experience by injecting context-aware translations, explanations, subtitle translation, and TTS directly into the browser. Supports selection translation, batch requests and configurable AI providers to balance cost and quality.

translation multilingual ai-client ai-tools speech+2

AI Dataset2025

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus With Rich Annotation For Dialectal Speech Processing

Yuhang Dai, Ziyu Zhang +14Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University, Beijing AISHELL Technology Co., Ltd. +3

Provides a 10,000-hour Sichuanese (Chuan-Yu) speech corpus with rich annotations (timestamps, speaker age/gender/emotion, SNR, DNSMOS) and unified metadata for ASR and TTS research; includes metadata.jsonl, evaluation benchmarks, and an LLM-assisted transcription pipeline.

ASR stt speech audio tts+4

AI Audio2025

Dograh AI

Dograh (Zansat Technologies Private Limited)

Build and self-host production voice agents with a drag-and-drop workflow builder, real-time telephony integration, and pluggable LLM/STT/TTS backends. Docker-first with an optional managed cloud offering for teams that want faster onboarding.

speech audio llm ai-agent docker+5

AI Audio2025

Supertonic

Supertone Inc.

Delivers multilingual, on-device text-to-speech via ONNX Runtime with prebuilt ONNX assets and cross-platform SDKs (Python, Node, mobile); targets low-latency, privacy-preserving TTS with ready demos and 31-language support in v3.

audio speech multilingual huggingface python+8

AI Audio2026

Pocket TTS

Manu Orsini, Simon Rouard +5

Generates low-latency, streaming text-to-speech entirely on CPUs (no GPU or cloud API required), using an ~100M-parameter model with voice cloning and multilingual support. Optimized for low resource use (2 CPU cores, ~200ms to first audio chunk) — suited for local, privacy-sensitive, or embedded TTS.

pytorch python speech multilingual cli+4

AI Deploy2026

SGLang-Omni

sgl-project

Orchestrates low-latency, multi-stage pipelines for omni and multimodal models by running each stage with its own scheduler and using zero-copy shared memory for tensor transfer. Emphasizes per-stage bottleneck tuning and OpenAI-compatible streaming endpoints, suitable for TTS and multimodal serving.

ai-serving ai-inference audio tts multimodal+3