AI Train2023

OpenRLHF

Trains LLMs with RLHF at scale by splitting actor, critic, reward, and reference models across separate GPU groups via Ray, with vLLM-accelerated generation and DeepSpeed ZeRO-3. Supports PPO, GRPO, REINFORCE++, DPO, plus async and agentic multi-turn RL.

Visit Website

Introduction

Most open RLHF stacks choke past a few billion parameters because actor, critic, reward, and reference models fight over the same GPUs. OpenRLHF's bet is scheduling, not just kernels: it uses Ray to place each model on its own GPU group, hands rollout generation to vLLM, and runs the rest on DeepSpeed ZeRO-3 — which is why 70B+ RLHF runs become routine instead of heroic.

What Sets It Apart

Disaggregated placement means you can right-size GPUs per role (a big actor, a small reward model) instead of co-locating everything and wasting memory.
vLLM-driven sampling removes the usual RLHF bottleneck, where slow generation, not gradient steps, dominates wall-clock time.
One algorithm-agnostic loop covers PPO, GRPO, RLOO, REINFORCE++/baseline, and DAPO, alongside SFT, reward-model training, and DPO/IPO — so switching methods is a flag, not a rewrite.
Async and agentic multi-turn modes via a token-in-token-out interface let it train reasoning and tool-use policies, not just single-turn preference tuning.

Who It's For

Great fit if you're running serious RLHF or RL-for-reasoning experiments on multi-node clusters and want vLLM throughput without gluing the pieces together yourself. Look elsewhere if you have a single GPU or just need quick DPO/SFT on small models — the Ray-plus-vLLM-plus-DeepSpeed setup is real operational overhead that only pays off at scale.

Back

Information

Websitegithub.com
OrganizationsOpenRLHF Team, ByteDance, Tencent, Netease Fuxi AI Lab, Alibaba Group
AuthorsJian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, Yu Cao
Published date2023/07/30

More Items

Reinforcement Learning Papers2026

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

Changhai Zhou, Kieran Liu +18

Enables RL post-training with million-token prompts under a fixed GPU budget by evaluating shared prompt state without autograd, retaining only minimal model state, and replaying short response branches; instantiated as GRPO and demonstrated on Qwen3.6-27B and GLM-5.2 up to multi-million token execution.

RL llm qwen mLOps ai-train+1

AI Infra2026

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry

Defines OpenTelemetry semantic conventions for generative AI telemetry — spans, metrics, and events for GenAI clients, the Model Context Protocol (MCP), and provider-specific integrations. Includes YAML models, human-readable docs, and reference implementations to standardize observability across GenAI deployments.

mcp mcp-client mcp-server mlops ai-api+3

AI Infra2024

TheRock

ROCm (AMD)

Provides a lightweight build platform for HIP and ROCm that supports building ROCm, PyTorch, and JAX from source, multi-architecture nightly releases, and integrated CI/CD and developer tooling for Linux and Windows.

pytorch github ai-framework ai-development docker+1