Tag
Explore by tags
2023
vLLM Project, Sky Computing Lab (UC Berkeley)
High-throughput inference and serving engine for LLMs that reduces KV-cache memory use with PagedAttention. Provides continuous batching, CUDA/HIP acceleration, Hugging Face integration, quantization, and an OpenAI-compatible API for production LLM serving.
2023
Ettore Di Giacinto (mudler), Community contributors
Runs LLMs, vision, audio and multimodal models locally with an OpenAI-compatible API, supporting CPU-only and GPU acceleration across 35+ backends. Includes built-in agents, multi-user access controls, a model gallery, and privacy-first local inference.
