The 6x gap lives in your code,
not your model.
Researchers at Stanford and MIT built a system that automatically searches for better harness code around language models. The harness is the code that decides what to retrieve, what to store, and what to present to the model at each step. Fixing the model and varying only the harness produced a 6x performance gap on the same benchmark. The binding constraint for many tasks turned out not to be model capability at all.
First surfaced in Tandemly Briefing — 2026-05-08.
The model gets all the attention.
The code around it doesn't.
When an LLM application underperforms, the default response is to swap the model. But the harness, the code that wraps every inference call, often has more influence over the output than the weights themselves.
Think about what happens when you deploy a language model to answer questions from a knowledge base. The model sees a prompt. That prompt was assembled by code. Some code decided which documents to retrieve, how many to include, in what order to present them, whether to truncate long ones, and how to format citations. That code is the harness. It runs before and after every model call, shaping what the model sees and what it returns.
Harnesses are still designed almost entirely by hand. Engineers draw on intuition, benchmarks, and iteration to decide how to structure context, which retrieval method to use, and how to manage memory across turns. The process is reasonable, but it has no systematic optimization step. You try some configurations, pick the one that seems best, and move on.
Text optimizers like TextGrad, ProTeGi, and OPRO have tried to close this gap by iteratively improving prompts. The problem is that they compress feedback too aggressively. They condition on scores, recent solutions, or fixed summary formats. When a harness has nontrivial structure, with conditional routing, multi-stage filtering, and context construction that depends on intermediate results, that compressed feedback loses the signal needed to guide improvement. You can't fix what you can't see.
What if you treated harness design as a search problem, gave the optimizer access to the complete history of everything that happened during prior runs, and let it write and rewrite actual code rather than just tweaking prompts?
A search loop with
full diagnostic access.
Meta-Harness is an outer loop that searches over harness code for LLM applications. The key design choice is what the proposer gets to read before suggesting its next candidate.
Most optimization loops hand the optimizer a score: "this candidate got 62.3, the previous one got 58.1, try again." Meta-Harness hands the optimizer everything. The source code of every prior harness candidate. The evaluation score for each. The full execution traces: what the model was shown, what it retrieved, what it returned, where the errors occurred. All of it lives in a filesystem that the proposer agent navigates using shell commands.
The proposer reads this diagnostic history, identifies patterns in what worked and what failed, and proposes a new harness. Not a new prompt. A new piece of code. The system supports conditional routing, filtering pipelines, multi-stage context construction, memory systems, and any other logic the proposer can write. The proposed harness gets evaluated on a held-out task set, results are logged to the filesystem, and the loop repeats.
This design uses roughly 400 times more context than the next best competing method. That sounds alarming. But the argument is that compressed feedback is what makes prior optimizers fail on structured code. The cost of more context is worth paying if it actually carries the signal needed to credit improvements to specific interventions.
The harnesses found by Meta-Harness are not trivially simple rewrites. They contain nontrivial control flow: conditional routing that sends different query types through different retrieval paths, multi-stage filtering that removes low-confidence results before context assembly, and memory management logic that decides which prior turns are worth keeping. These are domain-specific policies, not generic templates.
Harness design creates
a 6x performance spread.
The headline result is not that Meta-Harness beats a single baseline. It's that the choice of harness around a fixed model matters enough to produce a 6x performance difference. That reframes where the optimization budget should go.
On an online text classification task where the model must manage a growing context of labeled examples, Meta-Harness improved over a state-of-the-art context management system by 7.7 points while using 4 times fewer context tokens. The discovered harness generalized to 9 out-of-distribution task variants with 73.1% average accuracy, suggesting it learned something structural about context management rather than overfitting to the training distribution.
On 200 IMO-level (International Mathematical Olympiad) problems paired with a retrieval corpus, a single harness discovered by Meta-Harness improved accuracy by 4.7 points on average across five held-out models: from 34.1% to 38.8%. It outperformed BM25 retrieval by 1.3 points overall. The generalization to held-out models is notable. A harness found by optimizing against one model transferred to models the system had never seen, which means the discovered retrieval strategy was capturing something about the task structure rather than exploiting a particular model's quirks.
On TerminalBench-2, a benchmark of software engineering tasks requiring multiple tool calls, discovered harnesses outperformed the hand-engineered Terminus-KIRA baseline by 1.7 points. Among agents using Haiku 4.5, the Meta-Harness candidate ranked first (37.6% pass rate). Among Opus 4.6 agents, it ranked second (76.4%), trailing only ForgeCode. This is a different regime from the retrieval tasks: agentic coding requires the harness to manage tool selection, error recovery, and multi-step planning rather than just context assembly.
Meta-Harness converged to a good harness 10 times faster than OpenEvolve and TTT-Discover, the next best competing methods. Despite using substantially more context per iteration, the full diagnostic access appears to reduce wasted exploration. The optimizer finds the right direction faster when it can see why prior candidates failed rather than just that they failed.
The three evaluation domains (text classification, retrieval-augmented reasoning, agentic coding) were selected by the authors and may not represent the full range of LLM application types. The 400x context increase has real cost implications in production settings. The paper demonstrates that discovered harnesses generalize across models within these domains but does not characterize the failure modes of generalization.
Where the leverage
actually lives.
The 6x harness-induced spread is the finding that should change behavior. Before spending engineering time or budget on model upgrades, it is worth asking whether the harness has been optimized at all.
Where to go
from here.
The reference implementation is public and the paper is self-contained. Two experiments are particularly approachable for getting started.