AI Infra2024

AIBrix

Cloud-native control plane that scales vLLM on Kubernetes, adding the routing, autoscaling, and fault tolerance single-instance serving lacks. Brings high-density LoRA management, an LLM gateway, distributed KV cache reuse, and SLO-aware GPU serving.

Visit Website

Introduction

Running one vLLM instance is easy; running a fleet of them in production is where teams hit a wall. AIBrix exists because the hard problems of LLM serving aren't in the engine — they're in the control plane around it: how requests get routed to the replica that already has the right LoRA or KV cache warm, how you autoscale on tokens-per-second instead of CPU, and what happens when a GPU silently degrades mid-request.

What Sets It Apart

LoRA-aware and KV-aware routing: requests go to replicas that already hold the relevant adapter or cached prefix, instead of round-robin — the difference between a warm hit and recomputation.
Autoscaling tuned for LLM economics: scales on inference-specific signals, claiming up to ~4.7x cost savings in low-traffic windows and large P99 latency cuts under load.
Distributed KV cache shared across engines, so prefixes computed by one replica can be reused by others rather than recomputed per pod.
GPU failure detection plus heterogeneous serving with SLO targets, letting mixed hardware back the same deployment.

Who It's For

Great fit if you already run vLLM and are scaling past a single node — platform teams who need Kubernetes-native routing, autoscaling, and multi-LoRA density without building it themselves. Look elsewhere if you serve one model at modest traffic, where plain vLLM behind a load balancer is simpler, or if you aren't on Kubernetes — AIBrix assumes that substrate.

Back

Information

Websitegithub.com
OrganizationsByteDance
Authorsvllm-project
Published date2024/06/10

More Items

AI Infra2025

Apache Ossie

Apache Software Foundation

Defines a vendor-neutral JSON/YAML semantic model specification and tooling to exchange metrics, dimensions, lineage and other business semantics across analytics, AI and BI platforms; includes a core spec, validators, converters (dbt, GoodData, Salesforce) and example models.

json ai ai-development ai-tools github+2

AI Train2025

PRIME-RL

Prime Intellect

An asynchronous, high-throughput framework for large-scale reinforcement learning and agentic training that scales to 1T+ MoE models and 1000+ GPUs, with native verifiers integration, end-to-end SFT/RL/evals, and Slurm/Kubernetes deployment; requires NVIDIA GPUs.

RL agent-skills mLOps ai-train pytorch+3

MCP Server2025

Vexa

Vexa-ai

Runs a self-hosted meeting bot and transcription API that joins Google Meet, Teams and Zoom and streams speaker-attributed transcripts in real time. Compiles meetings into a git-backed Markdown workspace and runs sandboxed agents on your infrastructure; Apache-2.0 and air-gap capable.

stt mcp-server ai-agent ai-api chatbot+8