Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b
Most search agents are trained as policies over a growing transcript. The model decides how to search. It must also remember what it saw, which evidence matters, and which claims it checked. A team of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks too much. Reinforcement learning ends up optimizing both search decisions and routine bookkeeping at once.
Their answer is Harness-1, a 20B retrieval subagent built on gpt-oss-20b. It was trained with reinforcement learning inside a stateful search harness. The harness holds the bookkeeping. The policy keeps the semantic decisions. The weights and harness code are publicly released.

What is Harness-1 Actually
Harness-1 produces a ranked set of documents for a downstream answering model. It does not answer questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY.
Each turn works as a loop. The harness renders compact search state along with recent actions. The model emits one structured action. The harness executes it, updates state, and renders the next observation.
The Stateful Harness: What Moves Out of the Policy
The research team calls its principle stateful cognitive offloading. The policy decides what to search, curate, and verify, and when to stop. The harness maintains the recoverable state around those decisions.
That state includes several pieces. A candidate pool holds compressed, deduplicated documents. An importance-tagged curated set is the final output, capped at 30 documents. Tags take four values: very_high, high, fair, or low. A full-text store keeps every retrieved chunk outside the prompt.
An evidence graph adds structure. A regex extractor scans each chunk for proper nouns, years, and dates. The harness then renders frequent entities, bridge documents, and singletons. Bridge documents contain two or more frequent entities. Singletons appear in one document and suggest follow-up leads.
The policy works through eight tools. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search outputs are compressed with sentence-BM25, keeping the top four sentences. Two-level deduplication removes repeats by chunk ID and content fingerprint.
One design choice addresses cold starts. The first successful search auto-seeds the curated set with eight reranked results at fair importance. The policy then promotes strong documents and removes weak ones. This turns the task from building from scratch into refinement.
The research team names three requirements for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three.
How It is Trained
Training splits along the same line as the harness. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning improves search decisions over the maintained state.
A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for SFT. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint initializes RL.
RL uses on-policy CISPO with a 40-turn cap and terminal-only reward. It trains only on SEC queries. Groups with identical rewards are dropped from the gradient. Training ran on Tinker.
The reward separates discovery from selection. It also adds a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus near 0.53. With the bonus, diversity stabilizes and recall reaches about 0.60.
The Benchmark Case
Harness-1 was evaluated on eight benchmarks spanning web, finance, patents, and multi-hop QA. The main metric is curated recall: coverage of relevant documents in the final set. Trajectory recall counts evidence encountered anywhere in the episode.
Harness-1 reaches 0.730 average curated recall. That beats the next open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average.
The transfer pattern is the clearest signal of the mechanism. SFT used four benchmark families; RL used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. That is a 2.2x larger gain on tasks furthest from training data.
Ablations support the harness claim. Disabling all harness mechanisms drops Recall by 12.2 percent relative on BrowseComp+. The trained policy keeps searching but cannot rank what it sees.


Use Cases
The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this shape.
One is literature and patent review. The evidence graph and curated set help organize many sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks.
A third is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and better sets yield higher answer accuracy.
Strengths and Weaknesses
Strengths
Highest average curated recall among the open models tested, and behind only Opus-4.6 overall.
Gains hold on held-out benchmarks, suggesting domain-general search operations.
Trained on 4,352 unique items, far fewer than several baselines.
Open checkpoint and harness code, servable with common runtimes.
Weaknesses
The evidence graph uses regex extraction, not full entity linking.
The verify tool is an LLM proxy that can err on ambiguous claims.
Sentence-BM25 compression may drop context tied to discourse structure.
The research team reports point estimates without full confidence intervals.
Key Takeaways
Harness-1 is a 20B search agent that moves search bookkeeping into the environment, leaving semantic decisions to the policy.
It hits 0.730 average curated recall across eight benchmarks, beating the next open subagent by 11.4 points.
Among the searchers tested, only Opus-4.6 scores higher on average curated recall.
Gains are largest on held-out benchmarks (+17.0 vs +7.9 points), suggesting the learned search operations transfer.
Weights and harness code are public, servable via vLLM, SGLang, or Transformers.
Marktechpost’s Visual Explainer
Stateful Search Agents
1 / 7
Check out the Paper, Model weights and GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

