LogoAIAny
Icon for item

TensorRT-LLM

NVIDIA’s open-source library that compiles Transformer blocks into highly-optimized TensorRT engines for blazing-fast LLM inference on NVIDIA GPUs.

Introduction

Overview

TensorRT-LLM accelerates large-language-model inference by generating TensorRT engines with custom attention kernels, paged-KV caching, quantization (FP8/FP4/INT4/INT8) and speculative decoding.

Key Capabilities
  • Automatic engine generation from PyTorch checkpoints
  • In-flight batching and look-ahead decoding for high throughput
  • Multi-GPU / multi-node readiness via Triton back-end
  • Python & C++ runtimes with OpenAI-style API

Information

Categories