SGLang is a high-performance serving framework for large language models (LLMs) and vision-language models, designed for low-latency and high-throughput inference across single GPUs to large distributed clusters. Key features include RadixAttention for prefix caching, zero-overhead batch scheduling, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, and quantization (FP4/FP8/INT4/AWQ/GPTQ). It supports a wide range of models like Llama, Qwen, DeepSeek, and hardware from NVIDIA, AMD, Intel, TPUs, with an intuitive frontend for LLM applications.