A good interpretation
tells you what to do next.
First surfaced in Tandemly Briefing — 2026-05-25.
A team of leading interpretability researchers published a position paper arguing that the field has been measuring itself against the wrong standard. The question is not whether an explanation is internally elegant. It is whether the explanation points to a specific intervention, and whether that intervention has been shown to work.
Rich theories,
limited traction.
The interpretability field has built sophisticated tools for understanding what happens inside AI models. The gap between understanding and acting on that understanding has received far less attention.
Interpretability researchers have spent years developing techniques to look inside language models and neural networks. Which circuits handle factual recall? Which attention heads track syntactic relationships? Which neurons activate on specific concepts? The work is technically careful, often surprising, and has produced a genuine body of knowledge about how models process information.
But there is a gap between having an interpretation and being able to do something with it. If a researcher shows that a set of neurons activates on gender-stereotyped associations, the useful follow-up question is: what should I change, and does changing it actually fix the problem? An interpretation that cannot answer those questions is interesting. It is not yet a tool.
The problem this position paper names is that the field has largely been evaluating interpretability research on a different standard: whether an explanation is internally coherent, predictive of some behavioral signal, or elegant in its account of what the model does. That is a real scientific standard. But it is not sufficient to make a finding actionable, and the authors argue that the field has built most of its reward structure around the first standard without enough pressure on the second.
Can the person who reads this interpretability finding write down a specific thing to change in the model? And does making that change produce the expected effect without breaking something else? If neither question has a clear answer, the finding belongs in the exploratory tier, not the actionable tier.
Two dimensions,
one rubric.
The paper proposes a two-dimension rubric for evaluating any interpretability finding. The rubric applies regardless of the method, model, or domain.
The rubric creates four quadrants. Low concreteness and low validation is exploratory research: valuable for building understanding but not a basis for production decisions. High concreteness and low validation is a proposed intervention awaiting a test. High concreteness and high validation is genuinely actionable: a finding that tells you what to change and shows that the change works.
The paper also identifies five domains where interpretability that clears both criteria provides unique leverage beyond what behavioral testing can deliver. Behavioral testing tells you how a model performs on a distribution of tasks. It cannot easily tell you why a specific failure pattern is happening internally or how to fix it at the root. When interpretability is both concrete and validated, it can support root-cause interventions that behavioral benchmarking cannot.
Behavioral evaluation measures whether the model produces the right outputs on a distribution of inputs. It can confirm that a problem exists and whether a proposed fix changed the measured outputs. But it cannot tell you which internal mechanism caused a failure, making it hard to know whether a fix addresses the root cause or patches the symptom. Interpretability, when it clears the concreteness-validation bar, can bridge that gap in specific domains.
The authors are not arguing that interpretability replaces behavioral evaluation. They are arguing that in domains where root-cause intervention matters, interpretability that meets both criteria provides something behavioral testing cannot.
Most work falls short
on at least one dimension.
The authors survey the interpretability literature and apply their rubric. The pattern they find is consistent: the field has optimized more for concreteness than for validation.
This is a position paper, which means the evidence comes from literature review and argument rather than from a new controlled experiment. The authors read existing interpretability work through the lens of their two-dimension rubric and report what they find.
The pattern, they argue, is that the field has produced more work that is concrete than work that is validated. Researchers have become skilled at identifying specific components, directions, or circuits inside models that correlate with specific behaviors. That is the concreteness criterion. What has received less consistent attention is whether proposed interventions on those components actually produce the expected change in real-world use, without degrading other model capabilities. That is the validation criterion.
Some work is concrete but unvalidated: the proposed intervention is specific, but its downstream effects have not been measured across a realistic use-case distribution. Other work is validated in a narrow sense (it predicts some behavioral signal in a controlled setting) but too abstract to propose a specific real-world action. The paper argues the field needs both, and that grading work on only the first dimension has created a gap between how interpretability is evaluated in research and how it would need to perform to be trusted in production.
The paper identifies five domains where actionable interpretability, meeting both concreteness and validation criteria, provides something behavioral testing alone cannot. The common thread across these domains is that root-cause diagnosis matters: knowing that a failure exists is not sufficient, you need to know what internal structure produced it and how to change it. The specific domains are detailed in the full paper at arXiv:2605.11161.
If the field grades interpretability work primarily on concreteness and elegance of explanation, that is what gets produced. The position paper is an argument for adding validation as an explicit criterion in how interpretability research is evaluated, published, and adopted. The goal is not to raise the bar arbitrarily but to align the field's internal standards with what makes an interpretability finding safe to act on.
The authors are explicit that this is a call to the community rather than a statement about what any individual paper should have done differently. The argument is about what the field should optimize for going forward.
This synthesis is based on the daily briefing summary of the paper, since the arXiv page was not accessible during this run. The five specific domains, detailed rubric examples, and full literature survey are in the paper itself. The core argument and two-dimension rubric are accurately represented here, but the complete paper may contain additional nuance.
What this means
for teams building with AI.
The rubric applies beyond interpretability researchers. Anyone deciding whether to act on an interpretability finding, commission an audit, or invest in an interpretability tool can use concreteness and validation as a baseline check.
Where to go
from here.
The rubric is simple enough to apply without reading the full paper. But the paper itself is a position piece from active researchers across the field and is worth reading directly.