AI Systems · Harness Engineering

The 6x gap lives in your code,
not your model.

Researchers at Stanford and MIT built a system that automatically searches for better harness code around language models. The harness is the code that decides what to retrieve, what to store, and what to present to the model at each step. Fixing the model and varying only the harness produced a 6x performance gap on the same benchmark. The binding constraint for many tasks turned out not to be model capability at all.

Core concept

Harness optimization: automatically searching over the code surrounding a fixed model, including retrieval pipelines, context management, memory systems, and output formatting, to find the configuration that maximizes performance.

First surfaced in Tandemly Briefing — 2026-05-08.

scroll to explore

01The problem

The model gets all the attention.
The code around it doesn't.

When an LLM application underperforms, the default response is to swap the model. But the harness, the code that wraps every inference call, often has more influence over the output than the weights themselves.

Think about what happens when you deploy a language model to answer questions from a knowledge base. The model sees a prompt. That prompt was assembled by code. Some code decided which documents to retrieve, how many to include, in what order to present them, whether to truncate long ones, and how to format citations. That code is the harness. It runs before and after every model call, shaping what the model sees and what it returns.

Harnesses are still designed almost entirely by hand. Engineers draw on intuition, benchmarks, and iteration to decide how to structure context, which retrieval method to use, and how to manage memory across turns. The process is reasonable, but it has no systematic optimization step. You try some configurations, pick the one that seems best, and move on.

Text optimizers like TextGrad, ProTeGi, and OPRO have tried to close this gap by iteratively improving prompts. The problem is that they compress feedback too aggressively. They condition on scores, recent solutions, or fixed summary formats. When a harness has nontrivial structure, with conditional routing, multi-stage filtering, and context construction that depends on intermediate results, that compressed feedback loses the signal needed to guide improvement. You can't fix what you can't see.

The question this paper asks

What if you treated harness design as a search problem, gave the optimizer access to the complete history of everything that happened during prior runs, and let it write and rewrite actual code rather than just tweaking prompts?

02The experiment

A search loop with
full diagnostic access.

Meta-Harness is an outer loop that searches over harness code for LLM applications. The key design choice is what the proposer gets to read before suggesting its next candidate.

Most optimization loops hand the optimizer a score: "this candidate got 62.3, the previous one got 58.1, try again." Meta-Harness hands the optimizer everything. The source code of every prior harness candidate. The evaluation score for each. The full execution traces: what the model was shown, what it retrieved, what it returned, where the errors occurred. All of it lives in a filesystem that the proposer agent navigates using shell commands.

The proposer reads this diagnostic history, identifies patterns in what worked and what failed, and proposes a new harness. Not a new prompt. A new piece of code. The system supports conditional routing, filtering pipelines, multi-stage context construction, memory systems, and any other logic the proposer can write. The proposed harness gets evaluated on a held-out task set, results are logged to the filesystem, and the loop repeats.

This design uses roughly 400 times more context than the next best competing method. That sounds alarming. But the argument is that compressed feedback is what makes prior optimizers fail on structured code. The cost of more context is worth paying if it actually carries the signal needed to credit improvements to specific interventions.

Prior text optimizers

Score-only feedback. The optimizer sees the evaluation score, maybe a few recent solutions, or a fixed summary. Long-range dependencies in structured code are invisible. Credit assignment fails when the winning intervention happened three iterations ago.

Meta-Harness

Full diagnostic history. The proposer reads all prior source code, all evaluation scores, and all execution traces via a filesystem. It can identify which specific decision in which iteration drove the improvement. The feedback is expensive but complete.

What the discovered harnesses actually look like

The harnesses found by Meta-Harness are not trivially simple rewrites. They contain nontrivial control flow: conditional routing that sends different query types through different retrieval paths, multi-stage filtering that removes low-confidence results before context assembly, and memory management logic that decides which prior turns are worth keeping. These are domain-specific policies, not generic templates.

03Findings

Harness design creates
a 6x performance spread.

The headline result is not that Meta-Harness beats a single baseline. It's that the choice of harness around a fixed model matters enough to produce a 6x performance difference. That reframes where the optimization budget should go.

Harness-induced spread

performance gap, same model, different harness

Text classification gain

+7.7

points over SOTA, 4x fewer tokens

Math reasoning gain

+4.7

points on 200 IMO-level problems

Finding 1: Online text classification

On an online text classification task where the model must manage a growing context of labeled examples, Meta-Harness improved over a state-of-the-art context management system by 7.7 points while using 4 times fewer context tokens. The discovered harness generalized to 9 out-of-distribution task variants with 73.1% average accuracy, suggesting it learned something structural about context management rather than overfitting to the training distribution.

Finding 2: Retrieval-augmented math reasoning

On 200 IMO-level (International Mathematical Olympiad) problems paired with a retrieval corpus, a single harness discovered by Meta-Harness improved accuracy by 4.7 points on average across five held-out models: from 34.1% to 38.8%. It outperformed BM25 retrieval by 1.3 points overall. The generalization to held-out models is notable. A harness found by optimizing against one model transferred to models the system had never seen, which means the discovered retrieval strategy was capturing something about the task structure rather than exploiting a particular model's quirks.

Finding 3: TerminalBench-2 agentic coding

On TerminalBench-2, a benchmark of software engineering tasks requiring multiple tool calls, discovered harnesses outperformed the hand-engineered Terminus-KIRA baseline by 1.7 points. Among agents using Haiku 4.5, the Meta-Harness candidate ranked first (37.6% pass rate). Among Opus 4.6 agents, it ranked second (76.4%), trailing only ForgeCode. This is a different regime from the retrieval tasks: agentic coding requires the harness to manage tool selection, error recovery, and multi-step planning rather than just context assembly.

Finding 4: Speed of convergence

Meta-Harness converged to a good harness 10 times faster than OpenEvolve and TTT-Discover, the next best competing methods. Despite using substantially more context per iteration, the full diagnostic access appears to reduce wasted exploration. The optimizer finds the right direction faster when it can see why prior candidates failed rather than just that they failed.

Scope and limitations

The three evaluation domains (text classification, retrieval-augmented reasoning, agentic coding) were selected by the authors and may not represent the full range of LLM application types. The 400x context increase has real cost implications in production settings. The paper demonstrates that discovered harnesses generalize across models within these domains but does not characterize the failure modes of generalization.

04Practical takeaways

Where the leverage
actually lives.

The 6x harness-induced spread is the finding that should change behavior. Before spending engineering time or budget on model upgrades, it is worth asking whether the harness has been optimized at all.

For AI builders and platform teams

Audit your harness before upgrading your model. Map every place your application code makes a decision about what to store, retrieve, or present to the model. That is your search space. A 6x performance spread from harness variation means harness improvements can be worth several model-generation jumps, and they don't require renegotiating model contracts.

For teams using RAG or context management

The 7.7-point improvement alongside a 4x token reduction in the text classification experiment suggests that better harnesses don't just improve accuracy, they reduce cost. The two objectives are not in tension when the harness is retrieving the right things rather than more things. This has direct implications for production systems where token costs are a real line item.

For teams doing hard reasoning tasks

The generalization of discovered harnesses to held-out models is the most practically useful result. A harness found on one model transferred to models the optimizer had never seen. This means you can discover a good retrieval or context strategy once and apply it across a model family. You don't need to re-run the search every time you rotate models.

For ML researchers and tooling builders

This paper frames harness engineering as an optimization problem with a defined search space, a feedback signal, and a measurable objective. That framing invites a range of follow-on questions: What makes some harnesses transfer across models and others not? What search strategies work for different harness structures? The reference implementation at github.com/stanford-iris-lab/meta-harness provides a starting point for running the experiments and extending the approach.

A note on the cost tradeoff

The 400x context increase over the next best method is real. This is a research prototype, not a drop-in production tool. The case for running it is when you have a stable task, a clear benchmark, and a harness that hasn't been systematically optimized. In that setting, the upfront search cost is amortized across all production inferences using the discovered harness.

05Further exploration

Where to go
from here.

The reference implementation is public and the paper is self-contained. Two experiments are particularly approachable for getting started.

Read the paper

Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. Stanford University & MIT. arXiv:2603.28052.

Clone the reference implementation

The code lives at github.com/stanford-iris-lab/meta-harness. Two reference experiments are included: text classification with memory-system search, and TerminalBench-2 with scaffold evolution. The repository uses Claude Code as the proposer agent and includes an ONBOARDING.md for adapting to new domains.

Map the harness in your own system

Before running any optimization, identify every decision point where code shapes what the model sees. Retrieval method and ranking, context length and truncation strategy, memory across turns, output formatting and post-processing. That map is both your search space and your catalog of places where the 6x spread might be hiding.

Start with the text classification experiment

It is the most self-contained of the three reference experiments, with a clean baseline comparison against a published context management system. If you want to understand how the search loop works before adapting it, this is the clearest path in.

Read the related work on agentic coding scaffolds

TerminalBench-2, the agentic coding benchmark used in this paper, is maintained separately and worth understanding as an evaluation surface for agent harnesses. The ForgeCode agent that Meta-Harness narrowly trails among Opus 4.6 agents is a useful reference point for what hand-engineered scaffolding currently achieves.

The 6x gap lives in your code,not your model.

The model gets all the attention.The code around it doesn't.

A search loop withfull diagnostic access.

Harness design createsa 6x performance spread.

Where the leverageactually lives.

Where to gofrom here.