Overview
TensorRT-LLM accelerates large-language-model inference by generating TensorRT engines with custom attention kernels, paged-KV caching, quantization (FP8/FP4/INT4/INT8) and speculative decoding.
Key Capabilities
- Automatic engine generation from PyTorch checkpoints
- In-flight batching and look-ahead decoding for high throughput
- Multi-GPU / multi-node readiness via Triton back-end
- Python & C++ runtimes with OpenAI-style API