Most learning-to-search agents bundle growing transcripts and bookkeeping into the policy, forcing RL to optimize both semantic choices and recoverable routine state. Harness-1 flips that tradeoff: it externalizes working search state to the environment (candidate pools, curated evidence, compact evidence links, verification records, deduplicated observations, budget-aware context rendering) so the learned policy only decides what to search, what to keep or discard, what to verify, and when to stop. This separation simplifies the RL objective and improves generalization across retrieval domains.
Key Findings
- Empirical gains: on eight retrieval benchmarks (web, finance, patents, multi-hop QA) Harness-1 reaches a 0.730 average curated recall and outperforms the next strongest open search subagent by +11.4 points — while remaining competitive with much larger frontier searchers. So what: explicit state externalization yields large practical gains without scaling model size.
- Transfer robustness: performance stays strong on held-out transfer benchmarks, indicating the learned retrieval behavior generalizes beyond in-domain data. So what: RL over explicit search state helps find policies that transfer to new corpora and tasks.
- Practical release: code and evaluation tooling (BrowseComp+ evaluation runners) are published; a Hugging Face checkpoint (pat-jj/harness-1) and vLLM-compatible inference paths are provided, enabling reproduction and local serving.
Who it's for and trade-offs
Great fit if you are a retrieval researcher or engineer wanting to evaluate or serve a learned search subagent that separates state management from policy logic — especially when you can run GPU-backed vLLM inference and build compatible retrieval indexes (Chroma or similar). Look elsewhere if you need an out-of-the-box, zero-GPU service: reproducing full BrowseComp+ evaluations and RL training requires nontrivial compute, access to the released HF checkpoint, and dataset/index setup. Also, the harness design assumes an environment that can maintain and render externalized state; mission constraints that forbid stateful environment design will limit applicability.
Where it fits
Compared with transcript-based RL or vanilla RAG-style retrieval, Harness-1 treats bookkeeping (what’s been seen, curated evidence, verification records) as environment-managed artifacts. That lets the policy scale horizontally in decision quality instead of being forced to memorize or re-derive recoverable facts. Practically, it sits between classical retrieval+rerank pipelines and monolithic LLM browsing agents: you still need a retrieval backend and index, but the learned agent can make higher-level curation and verification choices that improve end-to-end curated recall.
Practical notes
The repository documents vLLM-based local inference, evaluation scripts for BrowseComp+, and model-export helpers. Reproducing full experiments requires a CUDA GPU, vLLM support, and the Hugging Face checkpoint; evaluation on some corpora additionally requires constructing a Chroma collection that matches the provided qrels.
