AI Deploy2025

NVIDIA Dynamo

Splits LLM inference into separate prefill and decode GPU pools, then routes requests with KV-cache awareness to cut redundant recomputation. Reports up to 30x throughput on DeepSeek-R1 (GB200 NVL72) and works across TensorRT-LLM, vLLM, and SGLang.

Visit Website

Introduction

Serving a reasoning model is two jobs pretending to be one: prefill is compute-bound, decode is memory-bound, and running both on the same GPU wastes whichever resource the current phase isn't using. Dynamo's bet is that you should physically separate them — distinct GPU pools for prefill and decode, each free to pick its own parallelism strategy — and let a smart layer stitch the request back together across the network.

Key Capabilities

Disaggregated prefill/decode: independent GPU pools per phase let each be tuned and scaled separately, instead of one compromise config; this is where the >2x Llama-70B-on-Hopper and up to 30x DeepSeek-R1-on-GB200-NVL72 gains come from.
KV-aware Smart Router: tracks KV cache across the fleet via a radix tree and routes requests to where the cache already lives, so prefixes aren't recomputed.
Distributed KV Cache Manager: spills cold KV blocks to CPU memory, SSD, or networked storage and keeps hot data on GPU, reclaiming HBM for active work.
NIXL transfer library + SLO Planner: low-latency point-to-point GPU data movement plus a planner that switches between disaggregated and traditional serving based on live SLO metrics.

Great Fit / Look Elsewhere

Great fit if you run reasoning or MoE models at multi-node scale and your bottleneck is GPU utilization under bursty traffic — the engine-agnostic design (TensorRT-LLM, vLLM, SGLang, PyTorch) means you keep your existing backend. Look elsewhere if you serve a single small model on one or two GPUs: disaggregation and a distributed router add operational overhead that only pays off across a fleet.

Back

Information

Websitedeveloper.nvidia.com
OrganizationsNVIDIA
AuthorsAmr Elmeleegy, Harry Kim, David Zier, Kyle Kranen, Neelay Shah, Ryan Olson, Omri Kahalon
Published date2025/03/18

More Items

Reinforcement Learning Papers2026

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

Changhai Zhou, Kieran Liu +18

Enables RL post-training with million-token prompts under a fixed GPU budget by evaluating shared prompt state without autograd, retaining only minimal model state, and replaying short response branches; instantiated as GRPO and demonstrated on Qwen3.6-27B and GLM-5.2 up to multi-million token execution.

RL llm qwen mLOps ai-train+1

AI Infra2026

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry

Defines OpenTelemetry semantic conventions for generative AI telemetry — spans, metrics, and events for GenAI clients, the Model Context Protocol (MCP), and provider-specific integrations. Includes YAML models, human-readable docs, and reference implementations to standardize observability across GenAI deployments.

mcp mcp-client mcp-server mlops ai-api+3

AI Infra2024

TheRock

ROCm (AMD)

Provides a lightweight build platform for HIP and ROCm that supports building ROCm, PyTorch, and JAX from source, multi-architecture nightly releases, and integrated CI/CD and developer tooling for Linux and Windows.

pytorch github ai-framework ai-development docker+1