Agentic Search · Retrieval Systems

The harness matters
more than the method.

Four major AI agent frameworks. The same 116 benchmark questions. Two retrieval methods: keyword search and semantic vector search. The framework around the search tool moved accuracy more than the choice between them. Claude Code favored grep. Gemini CLI favored vectors. The same data produced different winners.

Core concept
Provider-tooling inductive bias: each agent harness was shaped by its developers to work well with their native model stack, and that shaping encodes a preference for one retrieval method over the other. You inherit those preferences when you adopt the harness.
scroll to explore

First surfaced in Tandemly Briefing — 2026-05-17.

The field assumed
vector search was better.

Embedding pipelines and vector databases have become standard production infrastructure for AI retrieval. The assumption behind that investment was largely untested inside real agent loops.

When an AI agent needs to recall something from a large memory store, it typically chooses between two approaches. Grep scans text for exact or near-exact keyword matches. Vector retrieval converts text into numerical representations and finds the most semantically similar chunks by proximity in that space. The AI industry has broadly treated vector retrieval as the more capable option. Embedding pipelines, chunking strategies, and vector databases have become standard production infrastructure for any system that needs to retrieve information.

That assumption comes mostly from isolated retrieval benchmarks: a query goes in, a ranked list comes out, and precision and recall tell you how good it was. Those benchmarks test the search tool in isolation, not embedded inside an active agent harness with its own logic for context management, tool call formatting, and result delivery.

An agent isn't just retrieving. It retrieves, reasons, acts, and retrieves again, inside a framework with architectural opinions about how that loop should work. The question this study asks is whether the retrieval hierarchy that holds in isolation also holds when retrieval is one component inside a larger, opinionated system. The answer is more complicated than the field had assumed.

The question this paper asks

Across four production agent harnesses and two retrieval methods, which factor moves accuracy more: the choice between grep and vector retrieval, or the choice of harness itself?

Four harnesses,
116 questions, two methods.

The researchers took a long-context memory benchmark, ran it through four production agent frameworks with both retrieval methods, and also varied how the results were delivered back to the model.

The benchmark was a 116-question slice of LongMemEval, designed to test an agent's ability to recall relevant facts from a large, long-horizon memory store. Each question was run through both retrieval methods across four agent harnesses: Chronos, Claude Code, Codex, and Gemini CLI. The researchers also varied the delivery format: inline tool results (search results land directly in the conversation context window) versus file-based delivery (results are written to a file the model reads separately).

Each harness is a real production framework shaped by its developers to work well with a specific model family. Claude Code was tested with Claude Opus and Haiku. Codex with GPT-series models. Gemini CLI with Gemini 3.1 Pro. Each has distinct conventions for how it stores memory, how it issues tool calls, and how it formats prompts. The experiment measured not just retrieval methods in isolation, but the full pipeline from retrieval through harness integration through model reasoning to final output.

Questions tested
116
LongMemEval benchmark slice
Agent harnesses
4
Chronos, Claude Code, Codex, Gemini CLI
Retrieval methods
2
Grep (keyword) vs vector (semantic)
What LongMemEval measures

LongMemEval tests whether an agent can correctly recall specific facts from a long, dense memory store that contains many plausible but incorrect distractors. It is harder than standard QA benchmarks because the agent must retrieve the right piece of information from a context window that contains a lot of noise. The size of that context burden turns out to matter for which retrieval method wins.

Same data, different harnesses,
different winners.

The variance from switching harnesses was larger than the variance from switching retrieval methods. The architecture surrounding the search tool, not the search tool itself, was the primary driver of accuracy differences.

Conventional assumption
Vector beats grep. Semantic similarity is more flexible than keyword matching. Invest in embedding quality and chunking strategy. The retrieval method determines search quality.
What this study found
The harness beats both. Switching harnesses moved accuracy more than switching retrieval methods. And different harnesses favored opposite methods. The architectural frame around retrieval is the dominant variable.
Claude Code
Grep advantage (persistent)
With Claude Opus and Haiku, grep held a consistent accuracy advantage over vector retrieval. This advantage was not marginal and held across query types. The harness and model combination favors exact keyword matching.
Gemini CLI
Vector advantage (persistent)
With Gemini 3.1 Pro, the pattern reversed. Vector retrieval held a consistent accuracy advantage over grep. The same benchmark, the same questions, the opposite winner.
Chronos & Codex
Results vary by context load
The pattern for the remaining two harnesses interacted more with context density and delivery format. The key signal: even here, harness choice mattered more than the retrieval method alone.
Finding: Context density moderates the result

At earlier points in a memory timeline, when the context bundle is still small, vector retrieval tended to hold its own. As context grew larger and noisier, grep's advantage became more pronounced. Exact keyword matching appears better suited to the needle-in-a-haystack task that emerges when the agent must isolate a specific fact inside a heavily loaded context window.

Finding: Delivery format also shifted outcomes

Whether tool results were delivered inline (appearing directly in the conversation context) or file-based (written to a file the model reads separately) shifted accuracy in ways that interacted with each harness's behavior. This means the retrieval pipeline is not the only variable: how results reach the model's reasoning step is also a meaningful choice.

Scope and limitations

This synthesis is based on briefing coverage and the publicly available abstract. The full paper likely contains specific accuracy numbers per harness and delivery configuration that are not captured here. The study tests four specific harnesses on one benchmark slice; results on other tasks or with other frameworks may differ. The core finding about harness variance exceeding retrieval-method variance appears consistent across conditions, but the magnitude and direction for less-studied harnesses should be verified against the full paper before generalizing.

Audit the harness
before tuning the embeddings.

The harness is a first-class accuracy variable, not a developer-ergonomics concern. The choice of agent framework shapes whether grep or vector retrieval is the better default for your specific setup. Benchmarking retrieval outside your actual harness tells you less than you think.

1
For production teams running agentic RAG
Before investing further in embedding quality, chunking strategy, or vector infrastructure, run a controlled test of grep versus vector retrieval inside your actual harness. The retrieval method that wins in an isolated benchmark may not be the one that wins inside your specific framework. Let your harness, your model, and your actual queries determine the answer.
2
For Claude Code users
The evidence here suggests grep may be the better default retrieval mechanism when working with Claude Opus and Haiku inside Claude Code. If you've been defaulting to vector search because it's the conventional recommendation, this is worth testing. The grep advantage was persistent, not situational.
3
For teams experiencing accuracy drops at high memory load
The context-density finding is practical: as the memory store grows and the context window becomes noisier, grep's advantage over vector retrieval appears to grow. If your system degrades as memory accumulates, the retrieval method and delivery format are worth examining alongside context management approaches.
4
For anyone diagnosing stuck agentic performance
The practical diagnostic from the briefing: take a sample of production queries and run them through a grep-only path with file-based tool result delivery. If accuracy moves meaningfully, the retrieval method was not the bottleneck. The harness layer was. This is a low-cost experiment that takes less time than tuning embedding parameters.
5
For framework architects and platform teams
When you choose an agent harness for your organization, you are inheriting its developers' architectural assumptions about how retrieval should work. Understanding which direction a harness leans, and why, is relevant to the systems built on top of it. The inductive biases revealed here are an argument for systematic retrieval benchmarking as part of harness selection, not just capability demos.

Where to go
from here.

If you want to go deeper or run the comparison yourself.

1
Read the paper
Sen, Kasturi, Lumer, Gulati, Subbiah et al. (2026). Is Grep All You Need? How Agent Harnesses Reshape Agentic Search. arXiv:2605.15184.
2
Run the two-path experiment on your current stack
Select 50 to 100 representative production queries. Route each through both grep and vector retrieval inside your actual harness. Compare accuracy on each path. The result tells you more than any isolated retrieval benchmark can.
3
Test both delivery modes
If your harness supports file-based tool result delivery, compare it against inline delivery on the same benchmark subset. It is often a one-configuration change and can shift outcomes in ways that interact with harness behavior.
4
Explore LongMemEval for your own evaluations
LongMemEval is a public benchmark designed specifically to test long-horizon memory recall under distractor load. If your system handles growing memory stores or multi-session context, it is a more relevant eval target than standard short-context QA benchmarks.
5
Read the Meta-Harness synthesis for a complementary angle
The Meta-Harness paper (Lee et al., Stanford and MIT) showed that harness code variation around a fixed model produces a 6x accuracy spread. This paper shows that harness choice determines which retrieval method wins. Together, they make the same argument: the code surrounding the model is a first-class optimization target. See the Meta-Harness synthesis for the paired argument.