The harness matters
more than the method.
Four major AI agent frameworks. The same 116 benchmark questions. Two retrieval methods: keyword search and semantic vector search. The framework around the search tool moved accuracy more than the choice between them. Claude Code favored grep. Gemini CLI favored vectors. The same data produced different winners.
First surfaced in Tandemly Briefing — 2026-05-17.
The field assumed
vector search was better.
Embedding pipelines and vector databases have become standard production infrastructure for AI retrieval. The assumption behind that investment was largely untested inside real agent loops.
When an AI agent needs to recall something from a large memory store, it typically chooses between two approaches. Grep scans text for exact or near-exact keyword matches. Vector retrieval converts text into numerical representations and finds the most semantically similar chunks by proximity in that space. The AI industry has broadly treated vector retrieval as the more capable option. Embedding pipelines, chunking strategies, and vector databases have become standard production infrastructure for any system that needs to retrieve information.
That assumption comes mostly from isolated retrieval benchmarks: a query goes in, a ranked list comes out, and precision and recall tell you how good it was. Those benchmarks test the search tool in isolation, not embedded inside an active agent harness with its own logic for context management, tool call formatting, and result delivery.
An agent isn't just retrieving. It retrieves, reasons, acts, and retrieves again, inside a framework with architectural opinions about how that loop should work. The question this study asks is whether the retrieval hierarchy that holds in isolation also holds when retrieval is one component inside a larger, opinionated system. The answer is more complicated than the field had assumed.
Across four production agent harnesses and two retrieval methods, which factor moves accuracy more: the choice between grep and vector retrieval, or the choice of harness itself?
Four harnesses,
116 questions, two methods.
The researchers took a long-context memory benchmark, ran it through four production agent frameworks with both retrieval methods, and also varied how the results were delivered back to the model.
The benchmark was a 116-question slice of LongMemEval, designed to test an agent's ability to recall relevant facts from a large, long-horizon memory store. Each question was run through both retrieval methods across four agent harnesses: Chronos, Claude Code, Codex, and Gemini CLI. The researchers also varied the delivery format: inline tool results (search results land directly in the conversation context window) versus file-based delivery (results are written to a file the model reads separately).
Each harness is a real production framework shaped by its developers to work well with a specific model family. Claude Code was tested with Claude Opus and Haiku. Codex with GPT-series models. Gemini CLI with Gemini 3.1 Pro. Each has distinct conventions for how it stores memory, how it issues tool calls, and how it formats prompts. The experiment measured not just retrieval methods in isolation, but the full pipeline from retrieval through harness integration through model reasoning to final output.
LongMemEval tests whether an agent can correctly recall specific facts from a long, dense memory store that contains many plausible but incorrect distractors. It is harder than standard QA benchmarks because the agent must retrieve the right piece of information from a context window that contains a lot of noise. The size of that context burden turns out to matter for which retrieval method wins.
Same data, different harnesses,
different winners.
The variance from switching harnesses was larger than the variance from switching retrieval methods. The architecture surrounding the search tool, not the search tool itself, was the primary driver of accuracy differences.
At earlier points in a memory timeline, when the context bundle is still small, vector retrieval tended to hold its own. As context grew larger and noisier, grep's advantage became more pronounced. Exact keyword matching appears better suited to the needle-in-a-haystack task that emerges when the agent must isolate a specific fact inside a heavily loaded context window.
Whether tool results were delivered inline (appearing directly in the conversation context) or file-based (written to a file the model reads separately) shifted accuracy in ways that interacted with each harness's behavior. This means the retrieval pipeline is not the only variable: how results reach the model's reasoning step is also a meaningful choice.
This synthesis is based on briefing coverage and the publicly available abstract. The full paper likely contains specific accuracy numbers per harness and delivery configuration that are not captured here. The study tests four specific harnesses on one benchmark slice; results on other tasks or with other frameworks may differ. The core finding about harness variance exceeding retrieval-method variance appears consistent across conditions, but the magnitude and direction for less-studied harnesses should be verified against the full paper before generalizing.
Audit the harness
before tuning the embeddings.
The harness is a first-class accuracy variable, not a developer-ergonomics concern. The choice of agent framework shapes whether grep or vector retrieval is the better default for your specific setup. Benchmarking retrieval outside your actual harness tells you less than you think.
Where to go
from here.
If you want to go deeper or run the comparison yourself.