NVIDIA TensorRT is a production-grade ecosystem of compilers, runtimes, and model-optimization tools designed to squeeze maximum inference performance out of NVIDIA GPUs. It takes trained models from popular frameworks (ONNX, PyTorch, TensorFlow, etc.) and generates highly-tuned binary engines, leveraging graph fusion, kernel auto-tuning, layer/tensor fusion, and mixed-precision calibration (FP8, FP4, INT8, INT4) to deliver sub-millisecond latency and high throughput across data-center, edge, and embedded platforms.
The TensorRT family now includes TensorRT-LLM for large-language-model acceleration, Model Optimizer for advanced quantization and pruning workflows, TensorRT for RTX to target consumer GPUs on Windows, and cloud-hosted compilation services that auto-generate hyper-optimized engines. TensorRT integrates natively with Triton Inference Server, PyTorch, and ONNX Runtime, providing a single path from model prototyping to scalable deployment.
Originally introduced as the GPU Inference Engine (GIE), the technology was re-branded TensorRT on September 13 2016, marking its first public release to NVIDIA Developer Program members. Since then, it has undergone more than ten major versions, culminating in TensorRT 10.x and the open-sourcing of key parsers and plugins.