AI Infra2018

ONNX Runtime | Home

Runs ONNX models faster on CPU, GPU, and NPU by routing graph subgraphs to backend execution providers (CUDA, TensorRT, OpenVINO, DirectML, CoreML). One engine serves the same model across cloud, browser, mobile, and edge, for both inference and training.

Visit Website

Introduction

The hard part of shipping a model was never training it — it was getting the same model to run fast on a Windows laptop, an Android phone, an NVIDIA server, and a web page without rewriting it five times. ONNX Runtime's bet is that a single intermediate format plus a pluggable backend system can absorb that fragmentation, which is why it now quietly powers inference inside Windows, Office, and Bing as well as thousands of external products.

What Sets It Apart

Execution provider architecture: instead of one monolithic runtime, it partitions a model graph and hands each subgraph to the best available backend (CUDA, TensorRT, OpenVINO, DirectML, CoreML, or plain CPU), falling back gracefully when a kernel isn't supported. You write the model once and let deployment pick the accelerator.
Framework-agnostic reach: it consumes models exported from PyTorch, TensorFlow/Keras, scikit-learn, LightGBM, and XGBoost, so classical ML and deep learning share the same serving path.
Both directions of the pipeline: beyond inference it accelerates transformer training and supports on-device training for personalization without sending data off the device.

Who It's For

Great fit if you need one deployment target that spans server, browser, mobile, and edge, or if you want hardware-specific speedups without locking your code to a vendor SDK. Look elsewhere if you live entirely inside one framework's native serving stack (e.g. TorchServe) and never leave that hardware — the indirection of exporting to ONNX and tuning execution providers adds friction you won't recoup.

Back

Information

Websiteonnxruntime.ai
AuthorsMicrosoft
Published date2018/12/04

More Items

Reinforcement Learning Papers2026

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

Changhai Zhou, Kieran Liu +18

Enables RL post-training with million-token prompts under a fixed GPU budget by evaluating shared prompt state without autograd, retaining only minimal model state, and replaying short response branches; instantiated as GRPO and demonstrated on Qwen3.6-27B and GLM-5.2 up to multi-million token execution.

RL llm qwen mLOps ai-train+1

AI Infra2026

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry

Defines OpenTelemetry semantic conventions for generative AI telemetry — spans, metrics, and events for GenAI clients, the Model Context Protocol (MCP), and provider-specific integrations. Includes YAML models, human-readable docs, and reference implementations to standardize observability across GenAI deployments.

mcp mcp-client mcp-server mlops ai-api+3

AI Infra2024

TheRock

ROCm (AMD)

Provides a lightweight build platform for HIP and ROCm that supports building ROCm, PyTorch, and JAX from source, multi-architecture nightly releases, and integrated CI/CD and developer tooling for Linux and Windows.

pytorch github ai-framework ai-development docker+1