Tag

Explore by tags

Speech Technology Papers2015

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Dario Amodei, Rishita Anubhai +32Baidu Research

Bet that one neural net, scaled with HPC, could transcribe both English and Mandarin without hand-built pipelines — reaching human-competitive accuracy by training fast enough to iterate on architecture in days, not weeks.

30u30 paper audio ASR

AI Train2017

fairseq

Facebook AI Research (FAIR)Meta AI (formerly Facebook AI Research)

Sequence modeling toolkit for training custom models for translation, summarization, and language modeling. Reference implementation behind RoBERTa, BART, mBART, XLM-R, and wav2vec 2.0, with multi-GPU and mixed-precision training.

pytorch nlp translation ASR audio+5

AI Audio2019

NVIDIA NeMo

NVIDIA

Build, fine-tune, and deploy speech AI on NVIDIA GPUs: ASR, text-to-speech, and speech LLMs in one PyTorch stack. Ships pretrained Parakeet/Canary recognition and Magpie TTS checkpoints; broader LLM/multimodal training now lives in v2.7.0.

nvidia pytorch ASR audio huggingface+3

AI Audio2019

NVIDIA NeMo Speech

NVIDIA

Provides a toolkit and codebase for building, training, and deploying speech and multimodal models — Automatic Speech Recognition, Text-to-Speech, and speech-aware LLMs — with modular neural components and pre-trained checkpoints for PyTorch. Supports streaming/low-latency inference, multi-language models, and optional compiled kernels for acceleration.

nvidia speech ASR tts pytorch+6

AI Audio2022

Whisper

OpenAI

Multilingual sequence-to-sequence speech model and toolkit for speech recognition, speech-to-text translation, and language identification. Offers several model sizes (tiny → large/turbo) for different speed/accuracy trade-offs and ships with a CLI and Python API for offline transcription workflows.

openai speech ASR multilingual pytorch+4

AI Audio2022

FunASR

Alibaba DAMO Academy, Northwestern Polytechnical University (NWPU) +5Alibaba DAMO Academy, ModelScope

Bundles ASR, voice activity detection, punctuation, and speaker diarization into one pipeline, with pretrained models like Paraformer and SenseVoice. SenseVoice runs ~17x realtime on CPU; also ships streaming ASR and an OpenAI-compatible API.

ASR audio pytorch ai-library huggingface+4

AI Audio2023

faster-whisper

SYSTRAN

Reimplements OpenAI's Whisper speech-to-text on the CTranslate2 inference engine, running up to 4x faster at the same accuracy while using less memory. Adds a batched pipeline, 8-bit quantization, VAD filtering, and word-level timestamps.

ASR python gitHub audio ai-inference+1

AI Audio2023

RealtimeSTT

Kolja Beigel

Converts microphone or streamed audio to text with sub-second latency, pairing WebRTC/Silero voice-activity detection and wake-word activation with swappable local backends — faster-whisper by default, plus whisper.cpp, Moonshine, and sherpa-onnx.

ASR pytorch python github audio+4

AI Video2023

pyVideoTrans

jianchang512

Converts videos between languages by transcribing audio, translating subtitles, and producing AI dubbing—supports local and online ASR/LLM/TTS providers, speaker diarization, voice cloning, and GUI/CLI workflows for batch or headless use.

video ASR translation ai-video audio+4

AI Audio2023

Insanely Fast Whisper

Vaibhavs10

Terminal CLI for on-device Whisper ASR using Hugging Face Transformers + Optimum, with optional Flash Attention 2, batching, and diarization support — focused on high-throughput transcription on NVIDIA GPUs and Apple Silicon (mps).

huggingface openai ASR audio ai-inference+2

AI Audio2023

pyannote/speaker-diarization-3.1

pyannote

Performs speaker diarization (who spoke when) with pyannote-audio: combines voice-activity detection, speaker-change and overlapped-speech detection to produce time-stamped speaker segments; compatible with Hugging Face Endpoints and ASR pipelines.

huggingface audio speech ASR pytorch+2

AI Client2023

Open-LLM-VTuber

Hands-free voice-first companion with a Live2D avatar for real-time conversations with LLMs. Cross-platform web and desktop clients, runs locally or via cloud APIs, supports local ASR/TTS and modular customization for personas and models.

llm chatbot ai-client audio ASR+5

Tag

Explore by tags

Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

copilot

course

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gemini

gemini-cli

gemma

genomics

gitHub

github