LogoAIAny
Icon for item

FlashAttention

Fast, memory-efficient exact attention kernels that cut attention memory from quadratic to near-linear and speed up training/inference on GPUs (A100/H100). Offers multi-backend support (CUDA/Triton/ROCm), KV-cache for decoding, and successive FlashAttention versions for different hardware.

Introduction

Most transformer training and inference bottlenecks come from attention's O(sequence^2) memory and IO inefficiencies. By rethinking how attention is computed and scheduled on the GPU (IO-aware kernels, fused forward/backward, and careful work partitioning), this project makes exact scaled-dot-product attention practical for much longer contexts and higher throughput without resorting to approximation.

What Sets It Apart
  • IO-aware fused kernels and backward algorithms: reduces temporary memory and avoids large intermediate matrices, which means so what? you can train longer-context models or increase batch size without running out of memory.
  • Multi-generation support and hardware-specific optimizations (FlashAttention-2, FlashAttention-3 for Hopper/H100, FlashAttention-4 in CuTeDSL): so what? each release squeezes better parallelism or new GPU features to extract more throughput on A100/H100/B200-class devices.
  • Inference-focused features: paged KV-cache and flash_attn_with_kvcache for efficient incremental decoding, plus sliding-window/local attention and ALiBi support; so what? this lowers latency and memory use for autoregressive decoding at large sequence lengths.
  • Multi-backend portability (CUDA kernels, Triton/ROCm backends and Hugging Face kernels integration): so what? both NVIDIA and AMD environments can leverage the implementation, easing adoption across datacenter hardware.
Who It's For and Trade-offs

Great fit if you: need to train or serve transformer models with long context windows or want significant speed/memory gains on GPU clusters (A100/H100-level), or if you need a production-ready KV-cache for low-latency autoregressive decoding. It is also the de-facto low-level attention primitive for many LLM stacks and is widely adopted in model and kernel libraries.

Look elsewhere if you: target very old/low-end GPUs (Turing-level support is partial), need a pure-CPU or Windows-first solution, or prefer a drop-in attention replacement with zero build/test overhead—compiling high-performance kernels and matching CUDA/ROCm versions can be nontrivial. Also, some advanced features (e.g., FlashAttention-3/4) require recent GPUs and specific CUDA versions (H100/CUDA 12.3+ or recommended CUDA 12.8) which may limit portability.

Where It Fits

Adopts a systems-first approach: instead of approximating attention, it optimizes exact attention to be IO- and memory-efficient. Practitioners building high-throughput training pipelines, inference services, or custom model kernels (e.g., for Hugging Face or internal stacks) will find it most valuable. For small-scale experiments on consumer GPUs or CPU-bound workloads, the integration and build costs may outweigh the runtime benefits.

Information

  • Websitegithub.com
  • AuthorsDao-AILab, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
  • Published date2022/05/19

Categories