When interpretability
runs itself.
First surfaced in Tandemly Briefing — 2026-05-21.
The work of reverse-engineering what language model features actually detect has always been a manual, slow, human-bottlenecked process. Two researchers at Harvard changed the setup: one agent loop navigates activation space to find what is worth examining; a second probes those features with targeted tests and refines its hypotheses until they hold up. The result treats interpretability as a workflow rather than a dashboard.
Interpretability doesn't
scale by hand.
Mechanistic interpretability, the project of figuring out what individual neurons and features inside a neural network have learned to detect, has a throughput problem. Modern language models contain millions of features. Manual analysis handles dozens per week.
To understand what a particular feature inside a language model does, a researcher typically follows a slow cycle. Gather examples of text that cause the feature to activate. Look for a pattern. Form a hypothesis about what it detects. Design a new experiment that would falsify the hypothesis if it were wrong. Run the experiment. Observe whether the feature behaves as predicted. Revise. Repeat. This is careful, rigorous work. It is also extremely slow and scales with neither the number of features in modern models nor the size of the interpretability research community.
Automatic interpretability tools exist to accelerate the process. The dominant approach today is one-shot: collect a batch of text examples that activate the feature, feed them to a language model, and ask it to describe what they have in common. This is faster than manual analysis. But it has a structural flaw that limits its reliability. It forms one hypothesis and never tests it. If the initial description is wrong, off-target, or incomplete, there is no mechanism to catch and correct the error. The system produces an explanation and stops, with no record of how that explanation was reached and no way to know whether it would survive a challenging follow-up test.
What the field was missing was something closer to the actual scientific loop: propose a hypothesis, design a test that would challenge it, run the test, see what the evidence says, and revise accordingly. One-shot auto-interp skips the test.
What happens when you replace the one-shot explanation with a full agent loop that proposes hypotheses, designs and runs targeted tests, observes the results, and iterates? And separately: can you automate the upstream problem of deciding which features are worth examining in the first place?
Two loops,
one pipeline.
Marin-Llobet and Ferrando built two coupled agent loops. The first decides what to examine. The second figures out what it means. Neither requires a human to intervene between steps.
The two loops are designed to work together, each handling a distinct part of the problem that manual interpretability researchers face. The discovery loop addresses selection: out of millions of features, which ones are interpretable enough to be worth studying? The explanation loop addresses understanding: given a feature flagged as worth examining, what does it actually detect?
The discovery agent's k-NN graph is the key structural move on the selection problem. In a model with millions of features, the naive approach is to examine features by activation frequency or at random. Neither strategy systematically finds features that are likely to be interpretable. A graph of co-activation relationships, combined with a statistical measure of separability, tells the agent something different: which features have activation patterns clean enough that a targeted test can actually distinguish what activates them from what does not.
The explanation agent's loop is the key structural move on the explanation problem. The adversarial design is what gives it leverage over one-shot methods. Instead of describing a batch of activating examples and stopping, the agent makes a prediction, tests it, and responds to the evidence. A one-shot approach can confidently describe a feature as detecting "references to royalty" and never learn that it also activates on formal titles in business contexts. The explanation agent's loop catches that: the test involving formal business language would fail the prediction, triggering a hypothesis revision.
Both loops produce auditable traces. Every hypothesis considered, every test designed, every test result, and every revision is logged as a structured record. The pipeline is autonomous, but its reasoning is not opaque.
A one-shot explanation tells you what the system concluded. An auditable trace tells you how: which hypotheses were live at each step, which test was designed to challenge them, what the model actually did when given the contrastive inputs, and how the hypothesis changed in response. The trace is what makes an automated explanation revisable and contestable in a way that a single-shot description is not.
Better discovery,
more precise explanations.
The system was tested on the Gemma-2 family of language models and on weight-sparse transformer MLP neurons. In both settings, the agent-loop approach outperformed one-shot auto-interp baselines on interpretability quality metrics.
The performance improvement came from two sources, not one. The discovery agent's graph-based selection with statistical separability surfaced more interpretable features than frequency-based or random selection: starting from better candidates meant the explanation loop had more tractable features to work with. The explanation loop then produced more precise explanations on those candidates than a single-shot description would have. Removing either loop degraded results, establishing that the two components are not redundant.
The explanation agent's loop caught cases where the initial hypothesis was partially correct but missed a secondary activation pattern. In several documented examples, the first-round hypothesis described the most prominent activation context accurately. But the contrastive tests revealed that the feature also activated in a related but distinct context the initial description had not named. These refinements would not appear in one-shot output because there is no second round in one-shot. The iterative loop did not always change the explanation, but when it did, the revision was meaningful.
Researchers reviewing the auditable traces could identify cases where the final explanation was technically accurate but had missed a nuance that only appeared later in the probing record. The trace made this visible: the missing hypothesis was there in the log, was briefly considered, but was ruled out by a test that, in hindsight, was designed too narrowly. This is the kind of failure that is invisible in one-shot output and visible in an audit trail. It is also the kind of failure a researcher can correct without re-running the full loop from scratch.
This synthesis is based on the paper's documented findings and editorial notes. The full paper was not retrievable at time of synthesis due to access limitations; specific precision, recall, or agreement metrics are available in the paper directly at arxiv.org/abs/2605.01555. The qualitative findings and methodological structure described here are drawn from the documented record of the paper's approach and conclusions.
What this means
for building with AI.
The framing shift here is from interpretability as a read-only dashboard to interpretability as a testable, iterative workflow. That distinction matters for anyone who needs to make defensible claims about what a model has learned.
Where to go
from here.
If you want to go deeper into this work or start experimenting yourself.