AI Deploy2021

Serving Models | TFX | TensorFlow

Deploys trained SavedModels behind gRPC and REST endpoints, with hot-swappable versioning so new weights load without downtime. Built around servables, loaders, sources, and a manager, plus request batching to cut accelerator cost.

Visit Website

Introduction

Most teams can train a model long before they can keep one serving reliably in production — version rollouts, zero-downtime reloads, and request batching are where deployment quietly breaks. TensorFlow Serving treats inference itself as infrastructure: a model is just one kind of "servable" managed by the same lifecycle machinery that handles lookup tables or model ensembles, so the loading, versioning, and traffic logic stays identical whatever you ship.

What Sets It Apart

Decouples the what from the how via Sources (discover model versions on disk) and Loaders (standardize load/unload), so swapping in fresh weights is a file-system event, not a redeploy.
Runs multiple versions of the same model concurrently, which makes canary rollouts and A/B experiments a serving-layer concern rather than something you bolt on upstream.
Ships a request-batching widget that coalesces inference calls into one — a large win on GPUs/TPUs where per-call overhead dominates — without forcing clients to manage batching themselves.
Exposes both gRPC and REST, so the same SavedModel serves low-latency internal services and HTTP clients without rewrapping.

Who It's For

Great fit if you live in the TensorFlow ecosystem, ship SavedModels, and need production-grade versioning and throughput as part of a TFX pipeline. Look elsewhere if your models are PyTorch or ONNX, you want a framework-agnostic server, or your use case is a single low-traffic endpoint where the operational surface area isn't worth it.

Back

Information

Websitewww.tensorflow.org
OrganizationsGoogle
AuthorsTensorFlow
Published date2021/01/28

More Items

Reinforcement Learning Papers2026

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

Changhai Zhou, Kieran Liu +18

Enables RL post-training with million-token prompts under a fixed GPU budget by evaluating shared prompt state without autograd, retaining only minimal model state, and replaying short response branches; instantiated as GRPO and demonstrated on Qwen3.6-27B and GLM-5.2 up to multi-million token execution.

RL llm qwen mLOps ai-train+1

AI Infra2026

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry

Defines OpenTelemetry semantic conventions for generative AI telemetry — spans, metrics, and events for GenAI clients, the Model Context Protocol (MCP), and provider-specific integrations. Includes YAML models, human-readable docs, and reference implementations to standardize observability across GenAI deployments.

mcp mcp-client mcp-server mlops ai-api+3

AI Infra2024

TheRock

ROCm (AMD)

Provides a lightweight build platform for HIP and ROCm that supports building ROCm, PyTorch, and JAX from source, multi-architecture nightly releases, and integrated CI/CD and developer tooling for Linux and Windows.

pytorch github ai-framework ai-development docker+1