LogoAIAny
Icon for item

Insanely Fast Whisper

Terminal CLI for on-device Whisper ASR using Hugging Face Transformers + Optimum, with optional Flash Attention 2, batching, and diarization support — focused on high-throughput transcription on NVIDIA GPUs and Apple Silicon (mps).

Introduction

Most Whisper integrations focus on correctness; this project focuses on throughput — letting you transcribe long audio collections from the terminal by combining model-level choices with engineering shortcuts.

What Sets It Apart
  • Engineering-first speed paths: combines fp16, large batch sizes, BetterTransformer and optional Flash Attention 2 so a properly-provisioned A100 can transcribe large checkpoints orders of magnitude faster than out-of-the-box Transformers. This means bulk workloads (podcasts, call centers) can be processed far quicker if you can provide the GPU resources and dependencies.

  • Flexible model support: ships with presets for openai/whisper-large-v3 and lighter distil-whisper variants, so you trade off accuracy vs throughput explicitly. That matters when you need near-real-time throughput for very long audio vs highest possible WER for short clips.

  • Practical CLI features, not a library: built as an opinionated CLI with flags for device selection (CUDA device id or mps), batching, timestamp mode, and optional pyannote-based speaker diarization — designed for batch jobs and simple automation rather than library embedding.

Who it's for & tradeoffs

Great fit if you have access to beefy GPUs (e.g., A100) or Apple Silicon and need to transcribe large volumes quickly, and you can manage native dependencies (Flash Attention builds, correct PyTorch CUDA wheels). Look elsewhere if you need a plugin-style SDK for embedding into a larger app, require guaranteed cross-platform binary installs for non-technical users, or cannot install compiled extensions — some optimizations rely on platform-specific builds and can be fragile (mps is more memory-hungry; Windows CUDA/Torch mismatch issues have been reported).

Where it fits

Use this when throughput per dollar (or per-GPU-hour) matters and you can accept operational overhead to install and tune low-level accelerators. For small-scale or serverless deployments, hosted ASR services or simpler wrappers around Faster Whisper may be more convenient.

Practical notes

Benchmarks included by the author show orders-of-magnitude speedups on an Nvidia A100 when Flash Attention 2 and batching are enabled, but those numbers depend heavily on hardware, PyTorch/Flash-Attn builds, and model choice. The project also exposes diarization options via pyannote, but diarization requires a Hugging Face token and separate models.

Information

  • Websitegithub.com
  • AuthorsVaibhavs10
  • Published date2023/10/10

Categories