AIAny - LMDeploy

Introduction

Throughput claims in inference frameworks are usually apples-to-oranges, so the more interesting design choice here is splitting the engine in two. LMDeploy ships TurboMind, a hand-optimized CUDA path for maximum tokens/sec, alongside a pure-PyTorch engine for models the C++ path hasn't caught up to yet — you pick the trade-off instead of being locked into one runtime.

What Sets It Apart

Dual-engine architecture: TurboMind for production speed, the PyTorch engine for fast model coverage and easier hacking, behind one API.
Throughput comes from systems work, not just bigger batches — persistent (continuous) batching, blocked KV cache, dynamic split-and-fuse, and tuned kernels combine for a claimed ~1.8x over vLLM in request throughput.
Quantization is first-class: weight-only 4-bit AWQ (4-bit inference reported ~2.4x faster than FP16, with quality confirmed via OpenCompass evaluation) plus online KV-cache quantization, so memory savings don't require a separate toolchain.
A built-in request distribution service spreads load across multiple machines and GPUs without bolting on an external router.

Who It's For

Great fit if you serve InternLM-family or other mainstream open LLMs/VLMs and want an OpenAI-compatible endpoint with aggressive quantization and high concurrency on NVIDIA hardware. Look elsewhere if you need broad non-CUDA backend support, the absolute newest model architectures the day they drop, or you'd rather not weigh TurboMind-vs-PyTorch trade-offs — a single-runtime server may be simpler.

LMDeploy

Introduction

What Sets It Apart

Who It's For

Information

Categories

Tags

More Items

Triton Inference Server

codex-lb

Y2A-Auto