Text-Generation-Inference

Hugging Face’s Rust + Python server for high-throughput, multi-GPU text generation.

Visit Website

Introduction

Overview

TGI shards decoder weights across GPUs, streams tokens over SSE/GRPC and exposes an OpenAI-style /generate route.

Key Capabilities

Tensor / pipeline parallel & quantization
KV-cache, speculative decoding, vLLM sampler
Prometheus + Jaeger telemetry hooks

Back

Information

Websitehuggingface.co
AuthorsHugging Face
Published date2022/11/07

More Items

Ray

2017

RISELab (UC Berkeley), Anyscale Inc.

Ray is an open-source distributed compute engine that lets you scale Python and AI workloads—from data processing to model training and serving—without deep distributed-systems expertise.

ai-development ai-framework ai-train ai-serving

OpenVINO

2018

Intel

OpenVINO is an open-source toolkit from Intel that streamlines the optimization and deployment of AI inference models across a wide range of Intel® hardware.

ai-development ai-inference ai-serving

NVIDIA Dynamo

2025

NVIDIA

NVIDIA Dynamo is an open-source, high-throughput, low-latency inference framework that scales generative-AI and reasoning models across large, multi-node GPU clusters.

ai-development ai-inference ai-serving nvidia