AI Infra2017

Ray (by Anyscale)

Scales any Python or ML workload across CPUs and GPUs with a few decorators, instead of rewriting code for Spark or MPI. Bundles libraries for distributed training, hyperparameter tuning, RL, batch inference, and online model serving on one cluster.

Visit Website

Introduction

Most teams hit the same wall: a model trains fine on one machine, then scaling it means rewriting everything around Spark, MPI, or a bespoke task queue. Ray's bet is that scaling Python should not require switching frameworks — wrap a function with @ray.remote and it runs as a distributed task; the same primitives carry from a laptop to a thousand-node cluster.

What Sets It Apart

One runtime, full ML lifecycle: Ray Core handles distributed tasks and stateful actors, while Train, Tune, RLlib, Data, and Serve cover training, hyperparameter search, reinforcement learning, batch processing, and serving — so a pipeline stays on one cluster instead of being stitched across tools.
Born from research, hardened in production: it started at UC Berkeley's RISELab in 2017 and is now driven by Anyscale, with 1,000+ contributors and adoption at OpenAI, Uber, and Shopify for large-scale training and inference.
Heterogeneous scheduling: a single job can mix CPU and GPU tasks with fractional GPU allocation, which suits modern LLM and RL workloads that interleave the two.

Who It's For

Great fit if you have Python ML code that has outgrown one box and you want to scale without adopting a new programming model, or if you need training, tuning, and serving to share infrastructure. Look elsewhere if your data work fits squarely in Spark SQL or pandas at small scale — Ray's actor and cluster model adds operational overhead you won't recoup unless you actually need distribution.

Back

Information

Websitewww.ray.io
OrganizationsAnyscale, UC Berkeley RISELab
AuthorsAnyscale, RISELab (UC Berkeley)
Published date2017/12/16

More Items

AI API2026

CPA Manager Plus

seakee

Self-hosted CPA / CLIProxyAPI management and observability panel that stores request history, tracks cost/usage/quota, and centralizes provider/credential/OAuth and plugin management. Designed for local analytics, failure diagnosis and account automation without telemetry.

ai-api-management mLOps docker sqlite go+9

Reinforcement Learning Papers2026

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

Changhai Zhou, Kieran Liu +18

Enables RL post-training with million-token prompts under a fixed GPU budget by evaluating shared prompt state without autograd, retaining only minimal model state, and replaying short response branches; instantiated as GRPO and demonstrated on Qwen3.6-27B and GLM-5.2 up to multi-million token execution.

RL llm qwen mLOps ai-train+1

AI Infra2026

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry

Defines OpenTelemetry semantic conventions for generative AI telemetry — spans, metrics, and events for GenAI clients, the Model Context Protocol (MCP), and provider-specific integrations. Includes YAML models, human-readable docs, and reference implementations to standardize observability across GenAI deployments.

mcp mcp-client mcp-server mlops ai-api+3