Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b

Most search agents are trained as policies over a growing transcript. The model decides how to search. It must also remember what it saw, which evidence matters, and which claims it checked. A team of researchers from University of Illinois Urbana-Champaign, UC Berkeley, and Chroma argues this asks too much. Reinforcement learning ends up optimizing both search decisions and routine bookkeeping at once.

Their answer is Harness-1, a 20B retrieval subagent built on gpt-oss-20b. It was trained with reinforcement learning inside a stateful search harness. The harness holds the bookkeeping. The policy keeps the semantic decisions. The weights and harness code are publicly released.

What is Harness-1 Actually

Harness-1 produces a ranked set of documents for a downstream answering model. It does not answer questions itself. It runs inside a state-machine harness centered on a per-episode WORKINGMEMORY.

Each turn works as a loop. The harness renders compact search state along with recent actions. The model emits one structured action. The harness executes it, updates state, and renders the next observation.

The Stateful Harness: What Moves Out of the Policy

The research team calls its principle stateful cognitive offloading. The policy decides what to search, curate, and verify, and when to stop. The harness maintains the recoverable state around those decisions.

That state includes several pieces. A candidate pool holds compressed, deduplicated documents. An importance-tagged curated set is the final output, capped at 30 documents. Tags take four values: very_high, high, fair, or low. A full-text store keeps every retrieved chunk outside the prompt.

An evidence graph adds structure. A regex extractor scans each chunk for proper nouns, years, and dates. The harness then renders frequent entities, bridge documents, and singletons. Bridge documents contain two or more frequent entities. Singletons appear in one document and suggest follow-up leads.

The policy works through eight tools. These are fan_out_search, search_corpus, grep_corpus, read_document, review_docs, curate, verify, and end_search. Search outputs are compressed with sentence-BM25, keeping the top four sentences. Two-level deduplication removes repeats by chunk ID and content fingerprint.

One design choice addresses cold starts. The first successful search auto-seeds the curated set with eight reranked results at fair importance. The policy then promotes strong documents and removes weak ones. This turns the task from building from scratch into refinement.

The research team names three requirements for a trainable harness. These are warm-started curation, compact derived-state rendering, and diversity-preserving incentives. Harness-1 implements all three.

How It is Trained

Training splits along the same line as the harness. Supervised fine-tuning teaches the model to operate the interface. Reinforcement learning improves search decisions over the maintained state.

A single teacher, GPT-5.4, runs live inside the full harness. After filtering, 899 trajectories remain for SFT. The model uses LoRA at rank 32 for three epochs. The step-550 checkpoint initializes RL.

RL uses on-policy CISPO with a 40-turn cap and terminal-only reward. It trains only on SEC queries. Groups with identical rewards are dropped from the gradient. Training ran on Tinker.

The reward separates discovery from selection. It also adds a tool-diversity bonus. Without that bonus, the agent collapses to repeated search. Curated recall then plateaus near 0.53. With the bonus, diversity stabilizes and recall reaches about 0.60.

The Benchmark Case

Harness-1 was evaluated on eight benchmarks spanning web, finance, patents, and multi-hop QA. The main metric is curated recall: coverage of relevant documents in the final set. Trajectory recall counts evidence encountered anywhere in the episode.

ModelTypeAvg Curated RecallAvg Trajectory RecallHarness-1 (20B)Open small0.7300.807Tongyi DeepResearch 30BOpen small0.6160.673Context-1 (20B)Open small0.6030.756Search-R1 (32B)Open small0.2890.289GPT-OSS-20BOpen small0.2620.590Qwen3 (32B)Open small0.2160.446Opus-4.6Frontier0.7640.794GPT-5.4Frontier0.7090.752Sonnet-4.6Frontier0.6880.725Kimi-K2.5Frontier0.6470.794GPT-OSS-120BFrontier0.4960.769

Averages across eight benchmarks, from Figure 1 of the paper. Frontier models run as zero-shot retrievers under the Context-1 harness.

Harness-1 reaches 0.730 average curated recall. That beats the next open subagent, Tongyi DeepResearch 30B, by 11.4 points. Among the frontier searchers tested, only Opus-4.6 scores higher on average.

The transfer pattern is the clearest signal of the mechanism. SFT used four benchmark families; RL used only SEC. On those source-family tasks, Harness-1 gained 7.9 points over the closest open baseline. On four held-out benchmarks, it gained 17.0 points. That is a 2.2x larger gain on tasks furthest from training data.

Ablations support the harness claim. Disabling all harness mechanisms drops Recall by 12.2 percent relative on BrowseComp+. The trained policy keeps searching but cannot rank what it sees.

Use Cases

The method targets evidence-seeking retrieval where documents support an answer. Several workflows fit this shape.

One is literature and patent review. The evidence graph and curated set help organize many sources. Another is financial-filing analysis. The SEC case study recovers an exact executive-transition date across multiple 8-Ks.

A third is multi-hop fact-checking. The fan_out_search and verify tools resolve ambiguous entities before committing. A fourth is modular RAG. The curated set feeds a frozen generator, and better sets yield higher answer accuracy.

Strengths and Weaknesses

Strengths

Highest average curated recall among the open models tested, and behind only Opus-4.6 overall.

Gains hold on held-out benchmarks, suggesting domain-general search operations.

Trained on 4,352 unique items, far fewer than several baselines.

Open checkpoint and harness code, servable with common runtimes.

Weaknesses

The evidence graph uses regex extraction, not full entity linking.

The verify tool is an LLM proxy that can err on ambiguous claims.

Sentence-BM25 compression may drop context tied to discourse structure.

The research team reports point estimates without full confidence intervals.

Key Takeaways

Harness-1 is a 20B search agent that moves search bookkeeping into the environment, leaving semantic decisions to the policy.

It hits 0.730 average curated recall across eight benchmarks, beating the next open subagent by 11.4 points.

Among the searchers tested, only Opus-4.6 scores higher on average curated recall.

Gains are largest on held-out benchmarks (+17.0 vs +7.9 points), suggesting the learned search operations transfer.

Weights and harness code are public, servable via vLLM, SGLang, or Transformers.

Marktechpost’s Visual Explainer

Stateful Search Agents
1 / 7

Research Guide

Harness-1: a 20B search agent with a stateful harness

A retrieval subagent trained with reinforcement learning inside a search harness that holds the bookkeeping.

20B · gpt-oss-20b base
UIUC · UC Berkeley · Chroma
arXiv:2606.02373
Open weights & code

The Core Idea

Split the work between policy and harness

Most search agents pack search decisions and routine bookkeeping into one growing transcript. Harness-1 separates the two. The paper calls this stateful cognitive offloading.

Policy decides

What to search
Which documents to keep
What claims to verify
When to stop

Harness maintains

Candidate pool
Curated evidence
Verification records
Context budget

Inside the Harness

Environment-side working memory

Candidate pool — compressed, deduplicated documents
Curated set — importance-tagged, capped at 30 (very_high / high / fair / low)
Evidence graph — entities, bridges, and singletons via regex extraction
Verification cache — claim to document to yes/no verdict
Full-text store — every retrieved chunk kept outside the prompt
Compression — sentence-BM25 keeps the top four sentences

Policy Actions

Eight tools edit the state

The first successful search auto-seeds the curated set with eight reranked documents at fair importance. The policy then promotes strong documents and removes weak ones.

Training

SFT to operate the interface, RL to search

SFT: GPT-5.4 teacher inside the harness · 899 trajectories · LoRA rank 32 · step-550 checkpoint

RL: on-policy CISPO · SEC queries only · 40-turn cap · terminal reward · trained on Tinker

Data scale: 4,352 unique training items (899 SFT + 3,453 RL)

Three trainability requirements: warm-started curation, compact derived-state rendering, and diversity-preserving incentives.

Results

What the numbers show

0.730
average curated recallacross eight benchmarks

+11.4 pts over the next open subagent, Tongyi DeepResearch 30B

Among the searchers tested, only Opus-4.6 scores higher on average

Transfer: +17.0 on held-out vs +7.9 on source-family (2.2x gap)

Ablation: removing all harness mechanisms drops Recall 12.2% relative

Get Started

Run it yourself

Serve: vLLM, SGLang, or Transformers

Checkpoint: pat-jj/harness-1 (Hugging Face, 21B params, BF16)

Code: github.com/pat-jj/harness-1

Paper: arXiv:2606.02373

Harness-1 returns a curated set of documents for a downstream answering model. It does not answer questions itself.

Check out the Paper, Model weights and GitHub Repo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Source link