Long-horizon search agents accumulate many retrieved observations across repeated tool calls, forcing a trade-off between keeping more context tokens versus enabling additional interaction turns. This paper reframes simple observation masking as a regime-dependent intervention: masking helps in some retriever×model regimes but actively harms in others, producing an asymmetric inverted‑U of gains.
Key Findings
- Regime map: the accuracy gain from masking is not monotonic — it plateaus with weak retrievers, peaks when a strong retriever meets a mid-capacity model, and collapses when the model is saturated. So what? You can't assume masking always helps; its value depends on both retriever recall and model implicit filtering.
- Mechanism (token-for-turn trade-off): masking frees token budget for additional turns at the cost of removing potentially useful evidence. So what? Gains arise when extra turns convert failures into successes; losses occur when removed evidence was decisive.
- Robust sweep: results hold across multiple agent backbones (4B–284B parameters), three retrievers, and both offline and live-web search benchmarks. So what? The effect is broad, not an artifact of a single model or dataset.
- Practical artifact: authors release their scaffold and trajectories to enable reproducible follow-ups. So what? You can reproduce regime maps and test alternative context-management heuristics.
Who it's for and tradeoffs
Great fit if you design or evaluate agentic retrieval systems and need principled guidance on context management — especially when tuning retrievers and choosing model sizes for long-horizon search. Look elsewhere if your model is already saturated (very large model + very high-recall retriever) or your retriever is too weak; in those regimes masking is unlikely to help and can reduce accuracy. Also note masking is a lightweight, heuristic intervention — it informs when to prune but does not replace improvements to retrieval or model reasoning.
Where it fits
This work situates context masking alongside RAG-style retrieval and other memory/pruning strategies: instead of proposing a new retriever or learning-to-write memory, it provides an empirical and mechanistic guide for when a minimal masking heuristic is beneficial. Use it to decide whether to invest effort in smarter retrievers, larger models, or context-management policies for a given application.
