AI Infra2022

FlashAttention

Fused CUDA kernels that compute exact attention without ever writing the full N×N score matrix to GPU memory, cutting memory from quadratic to linear and speeding up training and inference on A100/H100. Ships FlashAttention-2/3 plus KV-cache decode paths.

Visit Website

Introduction

The bottleneck in attention was never the math — it's the memory traffic. A standard implementation writes the full N×N score matrix out to slow GPU HBM and reads it back, so attention is IO-bound long before it is compute-bound. The trick here is to tile the computation and keep intermediate scores in fast on-chip SRAM, streaming through the sequence and never materializing that matrix in HBM — producing the numerically exact same output.

What Sets It Apart

Exact, not approximate. Unlike sparse or low-rank attention, outputs are bit-for-bit equivalent to naive attention, so it drops into an existing model with zero accuracy trade-off — no retraining, no quality regression.
Linear memory in sequence length. Removing the quadratic activation term is what made training at 32k+ context practical on ordinary hardware, rather than a research stunt.
Each version chases the silicon. v2 rebalanced work across GPU warps for higher occupancy; v3 exploits Hopper (H100) async copies and FP8; Triton and ROCm backends extend reach beyond hand-written CUDA and NVIDIA.
A dedicated decode path. KV-cache and paged variants target the memory-bound single-token generation of inference, not just the throughput-bound training case.

Great Fit If / Look Elsewhere

Great fit if you train or serve transformers on modern NVIDIA GPUs and want speed and memory wins with no model changes — it is already the default kernel inside PyTorch, vLLM, and most training stacks, so you may be using it without knowing. Look elsewhere if you are on pre-Ampere cards (support is partial or absent), need CPU inference, or run short sequences where the score matrix already fits comfortably in cache — there the payoff shrinks toward noise.

Back

Information

Websitegithub.com
OrganizationsDao AI Lab
AuthorsDao-AILab, Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
Published date2022/05/19

More Items

Reinforcement Learning Papers2026

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

Changhai Zhou, Kieran Liu +18

Enables RL post-training with million-token prompts under a fixed GPU budget by evaluating shared prompt state without autograd, retaining only minimal model state, and replaying short response branches; instantiated as GRPO and demonstrated on Qwen3.6-27B and GLM-5.2 up to multi-million token execution.

RL llm qwen mLOps ai-train+1

AI Infra2026

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry

Defines OpenTelemetry semantic conventions for generative AI telemetry — spans, metrics, and events for GenAI clients, the Model Context Protocol (MCP), and provider-specific integrations. Includes YAML models, human-readable docs, and reference implementations to standardize observability across GenAI deployments.

mcp mcp-client mcp-server mlops ai-api+3

AI Infra2024

TheRock

ROCm (AMD)

Provides a lightweight build platform for HIP and ROCm that supports building ROCm, PyTorch, and JAX from source, multi-architecture nightly releases, and integrated CI/CD and developer tooling for Linux and Windows.

pytorch github ai-framework ai-development docker+1