AIAny - Speech Technology Papers

Bet that one neural net, scaled with HPC, could transcribe both English and Mandarin without hand-built pipelines — reaching human-competitive accuracy by training fast enough to iterate on architecture in days, not weeks.

30u30 paper audio ASR

AI Dataset2026

Waxal NLP Datasets

Google Research, Makerere University +6

Provides open ASR and TTS speech data for 24 Sub‑Saharan African languages to train and evaluate speech models. Includes ~1,250 hours of transcribed ASR and ~235 hours of single‑speaker TTS with train/validation/test/unlabeled splits and mixed CC-BY licenses.

multilingual audio speech ASR tts+3

Speech Technology Papers2026

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Ruiqi Li, Yu Zhang +4

Zero-shot TTS for expressive long-form monologue and multi-speaker dialogue, designed to preserve acoustic consistency, conversational coherence, and affective continuity. Trained on SwanData-Speech and using a 25 Hz VAE, pause-aware text conditioning, and a flow-matching DiT with DiffusionNFT fine-tuning.

paper speech audio voice foundation-model

Speech Technology Papers2026

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Ke Lei, Yu Zhang +5

Generates synchronized, streaming spatial audio from panoramic video and text prompts using a causal autoregressive diffusion transformer. Combines Spatial Video-Audio Contrastive (SVAC) alignment and online direct preference optimization (ODPO) to improve spatial perception, plus an automated annotation pipeline and public demos.

paper audio speech multimodal transformers+3

Speech Technology Papers2026

MMAE: A Massive Multitask Audio Editing Benchmark

Ziyang Ma, Ruiqi Yan +36

Provides a comprehensive benchmark for instruction-based audio editing across seven audio modalities and eight operation types, with 2,000 high-fidelity samples and a rubric that decomposes tasks into 17,741 verifiable criteria for multi-dimensional evaluation.

audio multimodal paper speech ai-leaderboard

Computer Vision Papers2026

LightMem-Ego: Your AI Memory for Everyday Life

Yijun Chen, Boyi Xiao +11

Continuously records egocentric visual and audio streams into a lightweight streaming memory that organizes experiences into current, short-term, and long-term tiers and retrieves multimodal evidence to answer queries about past events. Built for on-device use (smartphones/AI glasses) with dynamic retrieval routing.

multimodal vision audio mobile code+1

Category

Explore by categories

All Categories

AI Leaderboard

AI Agent Tutorials

AI Coding Tutorials

AI Model

AI Agent Papers

Chatbot

AI Dataset

Machine Learning Foundation Books

AI Train

AI Deploy

AI Client

Machine Learning Foundation Papers

Machine Learning Foundation Tutorials

AI Image Demos

AI Agent

Large Language Model Tutorials

Large Language Model Papers

Machine Learning Engineering Papers

Computer Vision Tutorials

Computer Vision Papers

Natural Language Processing Papers

Reinforcement Learning Papers

Speech Technology Papers

AI API

AI Coding

AI Image

AI Video

MLOps

MCP Client

MCP Server

AI Video Papers

AI Audio

AI Others

AI Infra

Embodied AI

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Waxal NLP Datasets

SwanVoice: Expressive Long-Form Zero-Shot Speech Synthesis for Both Monologue and Dialogue

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

MMAE: A Massive Multitask Audio Editing Benchmark

LightMem-Ego: Your AI Memory for Everyday Life