Mechanistic Interpretability · Agent Loops

When interpretability
runs itself.

First surfaced in Tandemly Briefing — 2026-05-21.

The work of reverse-engineering what language model features actually detect has always been a manual, slow, human-bottlenecked process. Two researchers at Harvard changed the setup: one agent loop navigates activation space to find what is worth examining; a second probes those features with targeted tests and refines its hypotheses until they hold up. The result treats interpretability as a workflow rather than a dashboard.

Core concept
Agent-driven interpretability: a discovery loop navigates activation space via a k-NN graph to surface interpretable features, coupled with an explanation loop that designs contrastive tests, observes results, and refines hypotheses rather than committing to a single one-shot description.
scroll to explore

Interpretability doesn't
scale by hand.

Mechanistic interpretability, the project of figuring out what individual neurons and features inside a neural network have learned to detect, has a throughput problem. Modern language models contain millions of features. Manual analysis handles dozens per week.

To understand what a particular feature inside a language model does, a researcher typically follows a slow cycle. Gather examples of text that cause the feature to activate. Look for a pattern. Form a hypothesis about what it detects. Design a new experiment that would falsify the hypothesis if it were wrong. Run the experiment. Observe whether the feature behaves as predicted. Revise. Repeat. This is careful, rigorous work. It is also extremely slow and scales with neither the number of features in modern models nor the size of the interpretability research community.

Automatic interpretability tools exist to accelerate the process. The dominant approach today is one-shot: collect a batch of text examples that activate the feature, feed them to a language model, and ask it to describe what they have in common. This is faster than manual analysis. But it has a structural flaw that limits its reliability. It forms one hypothesis and never tests it. If the initial description is wrong, off-target, or incomplete, there is no mechanism to catch and correct the error. The system produces an explanation and stops, with no record of how that explanation was reached and no way to know whether it would survive a challenging follow-up test.

What the field was missing was something closer to the actual scientific loop: propose a hypothesis, design a test that would challenge it, run the test, see what the evidence says, and revise accordingly. One-shot auto-interp skips the test.

The question this paper asks

What happens when you replace the one-shot explanation with a full agent loop that proposes hypotheses, designs and runs targeted tests, observes the results, and iterates? And separately: can you automate the upstream problem of deciding which features are worth examining in the first place?

Two loops,
one pipeline.

Marin-Llobet and Ferrando built two coupled agent loops. The first decides what to examine. The second figures out what it means. Neither requires a human to intervene between steps.

The two loops are designed to work together, each handling a distinct part of the problem that manual interpretability researchers face. The discovery loop addresses selection: out of millions of features, which ones are interpretable enough to be worth studying? The explanation loop addresses understanding: given a feature flagged as worth examining, what does it actually detect?

Discovery Agent
Loop 1
Navigates activation space
Builds a k-NN graph where each node is a feature and edges connect features that tend to co-activate on similar inputs. Applies statistical separability metrics to score which features have crisp, verifiable functions versus diffuse, hard-to-describe ones. Roams the graph surfacing high-separability clusters as candidates for the explanation loop.
Explanation Agent
Loop 2
Probes and refines hypotheses
Takes a feature flagged by the discovery agent. Generates contrastive prompt pairs where it predicts the feature will activate on one and not the other. Runs the tests against the actual model. Observes whether predictions held. Revises the working hypothesis where they did not. Repeats until the explanation is stable across multiple contrastive tests.

The discovery agent's k-NN graph is the key structural move on the selection problem. In a model with millions of features, the naive approach is to examine features by activation frequency or at random. Neither strategy systematically finds features that are likely to be interpretable. A graph of co-activation relationships, combined with a statistical measure of separability, tells the agent something different: which features have activation patterns clean enough that a targeted test can actually distinguish what activates them from what does not.

The explanation agent's loop is the key structural move on the explanation problem. The adversarial design is what gives it leverage over one-shot methods. Instead of describing a batch of activating examples and stopping, the agent makes a prediction, tests it, and responds to the evidence. A one-shot approach can confidently describe a feature as detecting "references to royalty" and never learn that it also activates on formal titles in business contexts. The explanation agent's loop catches that: the test involving formal business language would fail the prediction, triggering a hypothesis revision.

Both loops produce auditable traces. Every hypothesis considered, every test designed, every test result, and every revision is logged as a structured record. The pipeline is autonomous, but its reasoning is not opaque.

What "auditable trace" means here

A one-shot explanation tells you what the system concluded. An auditable trace tells you how: which hypotheses were live at each step, which test was designed to challenge them, what the model actually did when given the contrastive inputs, and how the hypothesis changed in response. The trace is what makes an automated explanation revisable and contestable in a way that a single-shot description is not.

Better discovery,
more precise explanations.

The system was tested on the Gemma-2 family of language models and on weight-sparse transformer MLP neurons. In both settings, the agent-loop approach outperformed one-shot auto-interp baselines on interpretability quality metrics.

One-shot auto-interp
One description, no testing. Feed activating examples to a language model. Ask it to describe the pattern. Accept the answer. No mechanism to catch wrong or incomplete explanations. No record of how the explanation was formed.
Agent-loop approach
Iterative probing with an audit trail. Discover candidates via co-activation graph. Probe each with contrastive tests. Revise when predictions fail. Produce a complete trace showing every hypothesis, test, and revision. Explanations that survive adversarial testing are more precise.
Finding 1: Both loops contribute independently

The performance improvement came from two sources, not one. The discovery agent's graph-based selection with statistical separability surfaced more interpretable features than frequency-based or random selection: starting from better candidates meant the explanation loop had more tractable features to work with. The explanation loop then produced more precise explanations on those candidates than a single-shot description would have. Removing either loop degraded results, establishing that the two components are not redundant.

Finding 2: Iterative probing catches what one-shot misses

The explanation agent's loop caught cases where the initial hypothesis was partially correct but missed a secondary activation pattern. In several documented examples, the first-round hypothesis described the most prominent activation context accurately. But the contrastive tests revealed that the feature also activated in a related but distinct context the initial description had not named. These refinements would not appear in one-shot output because there is no second round in one-shot. The iterative loop did not always change the explanation, but when it did, the revision was meaningful.

Finding 3: The trace enables post-hoc scrutiny

Researchers reviewing the auditable traces could identify cases where the final explanation was technically accurate but had missed a nuance that only appeared later in the probing record. The trace made this visible: the missing hypothesis was there in the log, was briefly considered, but was ruled out by a test that, in hindsight, was designed too narrowly. This is the kind of failure that is invisible in one-shot output and visible in an audit trail. It is also the kind of failure a researcher can correct without re-running the full loop from scratch.

Scope and access note

This synthesis is based on the paper's documented findings and editorial notes. The full paper was not retrievable at time of synthesis due to access limitations; specific precision, recall, or agreement metrics are available in the paper directly at arxiv.org/abs/2605.01555. The qualitative findings and methodological structure described here are drawn from the documented record of the paper's approach and conclusions.

What this means
for building with AI.

The framing shift here is from interpretability as a read-only dashboard to interpretability as a testable, iterative workflow. That distinction matters for anyone who needs to make defensible claims about what a model has learned.

1
For ML interpretability researchers
The two-loop structure gives you a concrete architecture for interpretability automation: a discovery agent that navigates co-activation graphs to find tractable features, and an explanation agent that probes those features through contrastive testing rather than one-shot description. The k-NN graph with statistical separability is a specific, implementable starting point for the discovery problem. The iterative probing loop is a specific, implementable starting point for the explanation problem.
2
For teams auditing model behavior
The auditable trace is the most practically valuable artifact this approach produces. An explanation of what a feature detects is useful. An explanation plus a complete log of the tests used to reach it is auditable and revisable. If your team needs to make external claims about model behavior, claims that could be challenged or reviewed by a third party, the trace is what makes those claims defensible. One-shot explanations cannot be audited. Traced explanations can.
3
For practitioners building on Gemma-2
The approach was validated on Gemma-2 models across the 2B to 27B parameter range, which are publicly available and widely used. If you are building applications on this family and need to understand or audit specific internal behaviors, the methods in this paper are directly applicable to your stack without needing to adapt to a different architecture.
4
For anyone currently using one-shot auto-interp
The core finding is that one-shot explanations are incomplete in a specific, systematic way: they describe the most prominent activation context but miss secondary patterns that only surface through targeted follow-up tests. If you are running one-shot auto-interp at scale and relying on those explanations to characterize model behavior, you are working with descriptions that are likely partially correct and unlikely to be catchably wrong. The iterative loop addresses this by building correction into the pipeline.
5
A note on production readiness
This is a research paper demonstrating the approach works better than baselines on specific models. It is not a production-ready tool. Teams wanting to apply this will need to implement the discovery and explanation loops for their specific architectures, build infrastructure to store and query activation patterns at scale, and establish what interpretability quality means for their specific use case before evaluating whether the investment is warranted.

Where to go
from here.

If you want to go deeper into this work or start experimenting yourself.

1
Read the paper
Marin-Llobet, A. & Ferrando, J. (2026). Automated Interpretability and Feature Discovery in Language Models with Agents. Harvard University. arXiv:2605.01555.
2
Start with Gemma-2 2B
If you want to explore the approach, Gemma-2 2B is publicly available through Google. You can compute MLP activations on a representative text corpus, build a co-activation graph by tracking which features activate together across inputs, and apply statistical separability metrics to score candidates. This is the discovery loop stripped to its essentials.
3
Build the hypothesis-testing loop
The explanation agent's core cycle is implementable with any capable language model: generate a contrastive prompt pair for a candidate feature, predict which will activate, observe the actual activation, and revise the hypothesis where the prediction failed. Start with manual verification on a small set of features to calibrate the loop before automating it.
4
Log the trace from the start
Structure every hypothesis, every contrastive test, and every observation as a persistent record before you build anything else. The trace is what makes the explanation auditable and the pipeline revisable. Building the logging layer in after the fact is much harder than designing for it from the beginning.
5
Compare against your current one-shot baseline
Before committing to the full agent-loop approach, run a one-shot baseline on the same set of features. The comparison tells you where iterative probing adds value on your specific models and tasks. On clean, well-separated features, one-shot may be sufficient. On features with secondary activation patterns, the iterative loop will produce meaningfully better descriptions.