LogoAIAny
Icon for item

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Uses search-agent reading traces and tiered distractors to train LLMs for long-context, multi-hop reasoning, and introduces a rubric reward that supervises entity-level steps (applied only to correct finals). Improves evidence-grounded reasoning and resists reward hacking across 4B–30B models.

Introduction

Long-context reasoning often fails because models can't reliably find and integrate sparse, multi-hop evidence amid many plausible distractors. LongTraceRL reframes the training signal: instead of sparse outcome-only rewards, it builds challenging contexts from search-agent trajectories and applies a fine-grained rubric that rewards correct entity-level steps — but only for responses with correct final answers — encouraging thorough, evidence-grounded chains without creating easy reward hacks.

Key Findings
  • Tiered distractors: training contexts combine documents the agent read-but-didn't-cite (high confusability) and search-result-only pages (low confusability). So what: models learn to distinguish highly plausible but irrelevant passages, improving retrieval and cite-quality under long contexts.
  • Rubric reward: uses gold entities along each reasoning chain as process supervision. So what: provides dense feedback on intermediate steps, letting RL distinguish stronger reasoning among correct-answer trajectories.
  • Positive-only strategy: rubric reward is applied only when the final answer is correct. So what: prevents models from exploiting partial matches or spurious signals and preserves reward integrity.
  • Empirical robustness: across 4B–30B reasoning LLMs and five long-context benchmarks, the approach consistently outperforms strong baselines. So what: the method scales across model sizes and benchmark varieties.
Who it's for + Trade-offs

Great fit if you are researching or engineering LLMs that must perform multi-hop, evidence-grounded reasoning over long documents and you can instrument or simulate search-agent traces and generate gold reasoning chains. Look elsewhere if your workload is limited to short-context Q&A, if you cannot provide intermediate-ground-truth entities, or if compute constraints prevent RL-style fine-tuning — constructing tiered distractors and running RLVR adds data-prep and training overhead.

Methodology (brief)

The paper synthesizes multi-hop questions via knowledge-graph random walks, replays search-agent trajectories to label which documents were read versus merely surfaced, and forms tiered distractors to boost confusability. The rubric reward scores entity-level correctness along the reasoning chain and is only granted for trajectories that yield a correct final answer. Combined with RL with verifiable rewards (RLVR), this yields denser, more targeted supervision for intermediate reasoning steps without weakening final-answer fidelity.

Information

  • Websitearxiv.org
  • AuthorsNianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li
  • Published date2026/05/29