AI Infra2024

exo

Connects multiple Macs and Linux machines into one cluster to run models too large for any single machine. Auto-discovers peers, shards a model across them via tensor parallelism, and exposes OpenAI-, Claude-, and Ollama-compatible APIs.

Visit Website

Introduction

The bottleneck for running frontier models at home was never the model — it was the assumption that you need one machine big enough to hold it. exo discards that assumption: it shards a single model across the Apple-silicon Macs and Linux boxes you already own, so a few mid-range machines can collectively serve a model none of them could load alone.

What Sets It Apart

Zero-config topology — devices on the network discover each other automatically and exo maps the cluster's shape, so adding a machine doesn't mean rewriting config; the model just spreads further.
Tensor parallelism, not only capacity — up to 1.8x speedup on 2 devices and 3.2x on 4, so extra hardware buys you throughput, not just room to fit a bigger model.
RDMA over Thunderbolt 5 on recent Apple silicon (M4 Pro/Max, M3 Ultra) cuts the inter-device latency that usually makes distributed inference slower than the math suggests.
Drop-in API surface — OpenAI Chat Completions, Claude Messages, and Ollama formats all work, so existing clients point at your cluster with no code changes.

Who It's For

Great fit if you own several Apple-silicon Macs (or a Mac-plus-Linux mix) and want to run models that won't fit in any one of them without renting cloud GPUs. Look elsewhere if you have a single large NVIDIA box — exo's Linux path is still CPU-only and its sweet spot is Metal/MLX. The peer-to-peer design also assumes a fast, trusted local network; it is not a replacement for managed serving at scale.

Back

Information

Websitegithub.com
Authorsexo labs
Published date2024/06/24

More Items

Reinforcement Learning Papers2026

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

Changhai Zhou, Kieran Liu +18

Enables RL post-training with million-token prompts under a fixed GPU budget by evaluating shared prompt state without autograd, retaining only minimal model state, and replaying short response branches; instantiated as GRPO and demonstrated on Qwen3.6-27B and GLM-5.2 up to multi-million token execution.

RL llm qwen mLOps ai-train+1

AI Infra2026

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry

Defines OpenTelemetry semantic conventions for generative AI telemetry — spans, metrics, and events for GenAI clients, the Model Context Protocol (MCP), and provider-specific integrations. Includes YAML models, human-readable docs, and reference implementations to standardize observability across GenAI deployments.

mcp mcp-client mcp-server mlops ai-api+3

AI Infra2024

TheRock

ROCm (AMD)

Provides a lightweight build platform for HIP and ROCm that supports building ROCm, PyTorch, and JAX from source, multi-architecture nightly releases, and integrated CI/CD and developer tooling for Linux and Windows.

pytorch github ai-framework ai-development docker+1

exo