AIAny - Reinforcement Learning Papers

First model to learn control policies straight from raw Atari pixels, pairing a convolutional net with Q-learning and experience replay. One unchanged architecture played seven games, beating prior methods on six and a human expert on three.

RL deepmind paper

Reinforcement Learning Papers2016

Mastering the game of Go with deep neural networks and tree search

David Silver, Aja Huang +18Google DeepMind

Combines a policy network (to narrow move choices) and a value network (to score board positions) with Monte Carlo tree search, cutting Go's vast search space enough to beat top programs 99.8% of the time and the European champion 5-0.

RL deepmind paper

Large Language Model Papers2022

InstructGPT: Training Language Models to Follow Instructions with Human Feedback

Long Ouyang, Jeff Wu +4OpenAI

Made reinforcement learning from human feedback (RLHF) the standard alignment recipe: collect demonstrations and preference rankings, train a reward model, then optimize with PPO. A 1.3B aligned model was preferred over the 175B GPT-3 by human raters.

openai RL paper LLM NLP

Large Language Model Papers2025

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

DeepSeek-AI, Aixin Liu +262DeepSeek-AI

An open large language model pairing DeepSeek Sparse Attention (DSA) for cheaper long-context inference with a scaled RL pipeline. Authors claim parity with GPT-5, with a high-compute Speciale variant surpassing it and rivaling Gemini-3.0-Pro on reasoning.

deepseek LLM paper RL ai-agent

Reinforcement Learning Papers2026

Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

Jiapeng Zhu, Jianxiang Yu +6

Combines internalizing general skills with task-specific skill utilization via a difficulty-aware router to improve in-distribution and out-of-distribution performance for agentic RL. Uses privileged distillation for hard tasks and diagnostic probing for easy tasks; evaluated on ALFWorld and WebShop.

agent-skills RL ai-agent paper

Large Language Model Papers2026

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Nianyi Lin, Jiajie Zhang +2

Uses search-agent reading traces and tiered distractors to train LLMs for long-context, multi-hop reasoning, and introduces a rubric reward that supervises entity-level steps (applied only to correct finals). Improves evidence-grounded reasoning and resists reward hacking across 4B–30B models.

RL LLM NLP paper code+1

Reinforcement Learning Papers2026

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

Lei Yang, Siyu Ding +1

Analyzes how single-domain RL fine-tuning on LLMs induces cross-domain interference and shows this damage concentrates in a low-dimensional shared conflict subspace; proposes a local perturbation theory and short domain "refresh" procedures that selectively recover earlier domains with minimal collateral loss.

RL LLM paper NLP code+1

Reinforcement Learning Papers2026

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Pengcheng Jiang, Zhiyi Shi +6

A 20B retrieval subagent trained with reinforcement learning inside a stateful search harness that externalizes recoverable search state (candidate pool, curated evidence, verification records). The harness lets the policy focus on semantic search decisions, improving curated recall and transfer robustness.

RL ai-agent agent-skills vllm huggingface+1

Natural Language Processing Papers2026

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

Hanxu Hu, Zdeněk Šnajdr +3

Trains LLMs with reinforcement learning using a surface chrF reward so models learn to extract and apply linguistic signals from rich context for translating completely unseen languages. Demonstrates better zero-shot translation than in-context learning or supervised fine-tuning, framing outcome-based RL as a meta-skill for language learning from context.

RL multilingual translation NLP LLM+1

Large Language Model Papers2026

On the Geometry of On-Policy Distillation

Zhennan Shen, Yanshu Li +7

Analyzes the parameter-space geometry of on-policy distillation (OPD) for LLM training, showing OPD updates affect fewer weights, avoid principal directions, and rapidly lock into a low-dimensional update subspace. Compares OPD with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) and studies implications for optimization and objective mixing.

paper LLM RL NLP foundation-model+2

Reinforcement Learning Papers2026

APPO: Agentic Procedural Policy Optimization

Xucong Wang, Ziyu Ma +6

Shifts branching and credit assignment in agentic RL from coarse units to fine-grained decision points in generated sequences. Uses a Branching Score combining token uncertainty and policy-induced likelihood gains plus procedure-level advantage scaling; improves performance across 13 benchmarks while keeping efficient tool calls.

RL ai-agent LLM paper evaluation+1

Large Language Model Papers2026

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Jiacheng Chen, Xinyu Zhang +21

Applies a population-level test-time scaling strategy that uses one model as generator, verifier, refiner, and ranker to search over candidate proofs. Combines generative-verifier RL and a low false-positive verifier with tournament selection to reach competition-level performance on IMO and USAMO.

paper LLM RL ai ai-rank+2

Category

Explore by categories

All Categories

AI Leaderboard

AI Agent Tutorials

AI Coding Tutorials

AI Model

AI Agent Papers

Chatbot

AI Dataset

Machine Learning Foundation Books

AI Train

AI Deploy

AI Client

Machine Learning Foundation Papers

Machine Learning Foundation Tutorials

AI Image Demos

AI Agent

Large Language Model Tutorials

Large Language Model Papers

Machine Learning Engineering Papers

Computer Vision Tutorials

Computer Vision Papers

Natural Language Processing Papers

Reinforcement Learning Papers

Speech Technology Papers

AI API

AI Coding

AI Image

AI Video

MLOps

MCP Client

MCP Server

AI Video Papers

AI Audio

AI Others

AI Infra

Embodied AI

Playing Atari with Deep Reinforcement Learning

Mastering the game of Go with deep neural networks and tree search

InstructGPT: Training Language Models to Follow Instructions with Human Feedback

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

On the Geometry of On-Policy Distillation

APPO: Agentic Procedural Policy Optimization

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling