Interpretability · Research Methods

A good interpretation
tells you what to do next.

First surfaced in Tandemly Briefing — 2026-05-25.

A team of leading interpretability researchers published a position paper arguing that the field has been measuring itself against the wrong standard. The question is not whether an explanation is internally elegant. It is whether the explanation points to a specific intervention, and whether that intervention has been shown to work.

The rubric
Concreteness: does the finding specify what to change? Validation: has that change been tested and confirmed to work? A finding that cannot clear both bars is exploratory, not actionable.
scroll to explore

Rich theories,
limited traction.

The interpretability field has built sophisticated tools for understanding what happens inside AI models. The gap between understanding and acting on that understanding has received far less attention.

Interpretability researchers have spent years developing techniques to look inside language models and neural networks. Which circuits handle factual recall? Which attention heads track syntactic relationships? Which neurons activate on specific concepts? The work is technically careful, often surprising, and has produced a genuine body of knowledge about how models process information.

But there is a gap between having an interpretation and being able to do something with it. If a researcher shows that a set of neurons activates on gender-stereotyped associations, the useful follow-up question is: what should I change, and does changing it actually fix the problem? An interpretation that cannot answer those questions is interesting. It is not yet a tool.

The problem this position paper names is that the field has largely been evaluating interpretability research on a different standard: whether an explanation is internally coherent, predictive of some behavioral signal, or elegant in its account of what the model does. That is a real scientific standard. But it is not sufficient to make a finding actionable, and the authors argue that the field has built most of its reward structure around the first standard without enough pressure on the second.

The central question

Can the person who reads this interpretability finding write down a specific thing to change in the model? And does making that change produce the expected effect without breaking something else? If neither question has a clear answer, the finding belongs in the exploratory tier, not the actionable tier.

Two dimensions,
one rubric.

The paper proposes a two-dimension rubric for evaluating any interpretability finding. The rubric applies regardless of the method, model, or domain.

Dimension 1: Concreteness
How specific is the intervention the finding enables? A finding that says "this model uses shortcut features" is low on concreteness. A finding that says "activations in layer 14 encode a specific bias direction that can be identified with probe P, and ablating that direction changes the model's outputs on task T" is high on concreteness. The more precisely the finding identifies what to modify, the more it can guide action.
Dimension 2: Validation
Has the intervention been tested and confirmed to work? A finding can propose a very specific intervention and still be unvalidated if nobody has run the modification and measured the result. Validation requires showing that the proposed change produces the expected behavioral outcome without degrading the model's other capabilities. Prediction alone is not validation.

The rubric creates four quadrants. Low concreteness and low validation is exploratory research: valuable for building understanding but not a basis for production decisions. High concreteness and low validation is a proposed intervention awaiting a test. High concreteness and high validation is genuinely actionable: a finding that tells you what to change and shows that the change works.

The paper also identifies five domains where interpretability that clears both criteria provides unique leverage beyond what behavioral testing can deliver. Behavioral testing tells you how a model performs on a distribution of tasks. It cannot easily tell you why a specific failure pattern is happening internally or how to fix it at the root. When interpretability is both concrete and validated, it can support root-cause interventions that behavioral benchmarking cannot.

Why behavioral testing alone is not enough

Behavioral evaluation measures whether the model produces the right outputs on a distribution of inputs. It can confirm that a problem exists and whether a proposed fix changed the measured outputs. But it cannot tell you which internal mechanism caused a failure, making it hard to know whether a fix addresses the root cause or patches the symptom. Interpretability, when it clears the concreteness-validation bar, can bridge that gap in specific domains.

The authors are not arguing that interpretability replaces behavioral evaluation. They are arguing that in domains where root-cause intervention matters, interpretability that meets both criteria provides something behavioral testing cannot.

Most work falls short
on at least one dimension.

The authors survey the interpretability literature and apply their rubric. The pattern they find is consistent: the field has optimized more for concreteness than for validation.

This is a position paper, which means the evidence comes from literature review and argument rather than from a new controlled experiment. The authors read existing interpretability work through the lens of their two-dimension rubric and report what they find.

The pattern, they argue, is that the field has produced more work that is concrete than work that is validated. Researchers have become skilled at identifying specific components, directions, or circuits inside models that correlate with specific behaviors. That is the concreteness criterion. What has received less consistent attention is whether proposed interventions on those components actually produce the expected change in real-world use, without degrading other model capabilities. That is the validation criterion.

Some work is concrete but unvalidated: the proposed intervention is specific, but its downstream effects have not been measured across a realistic use-case distribution. Other work is validated in a narrow sense (it predicts some behavioral signal in a controlled setting) but too abstract to propose a specific real-world action. The paper argues the field needs both, and that grading work on only the first dimension has created a gap between how interpretability is evaluated in research and how it would need to perform to be trusted in production.

Five domains with unique leverage

The paper identifies five domains where actionable interpretability, meeting both concreteness and validation criteria, provides something behavioral testing alone cannot. The common thread across these domains is that root-cause diagnosis matters: knowing that a failure exists is not sufficient, you need to know what internal structure produced it and how to change it. The specific domains are detailed in the full paper at arXiv:2605.11161.

The field-level implication

If the field grades interpretability work primarily on concreteness and elegance of explanation, that is what gets produced. The position paper is an argument for adding validation as an explicit criterion in how interpretability research is evaluated, published, and adopted. The goal is not to raise the bar arbitrarily but to align the field's internal standards with what makes an interpretability finding safe to act on.

The authors are explicit that this is a call to the community rather than a statement about what any individual paper should have done differently. The argument is about what the field should optimize for going forward.

Scope of this synthesis

This synthesis is based on the daily briefing summary of the paper, since the arXiv page was not accessible during this run. The five specific domains, detailed rubric examples, and full literature survey are in the paper itself. The core argument and two-dimension rubric are accurately represented here, but the complete paper may contain additional nuance.

What this means
for teams building with AI.

The rubric applies beyond interpretability researchers. Anyone deciding whether to act on an interpretability finding, commission an audit, or invest in an interpretability tool can use concreteness and validation as a baseline check.

1
For AI practitioners reviewing interpretability findings
Before treating a result as actionable, apply the two-question test: what specific intervention does this finding enable, and has that intervention been validated to produce the expected change? If either answer is vague, treat the finding as exploratory research. Do not let it drive decisions about model behavior in production.
2
For AI safety teams
Behavioral testing can only catch failures in scenarios you have anticipated and can measure. The paper argues that interpretability, when it clears both criteria, can surface failure modes from the inside before they manifest externally. This is the specific domain where actionable interpretability has leverage that behavioral testing cannot replicate.
3
For ML researchers doing interpretability work
The rubric is a direct challenge to how findings are scoped and reported. Showing that an interpretation is predictive of some behavioral signal is necessary but not sufficient. The additional ask is to specify the intervention the finding enables and to test whether that intervention works as intended across a realistic distribution. Publishing with both pieces of evidence substantially strengthens the case for adoption.
4
For business leaders commissioning interpretability audits
An audit that surfaces interesting internal patterns is different from an audit that delivers specific proposed interventions with validation evidence. When scoping an audit, require both: a description of the finding and a proposed intervention, plus evidence that the intervention produces the expected behavioral change. Auditors who cannot deliver both are providing exploratory analysis, not an actionable report.

Where to go
from here.

The rubric is simple enough to apply without reading the full paper. But the paper itself is a position piece from active researchers across the field and is worth reading directly.

1
Read the paper
Orgad, Barez, Haklay, Lee, Mosbach, Reusch, Saphra, Wallace, Wiegreffe, Wong, Tenney & Geva (2026). Interpretability Can Be Actionable. arXiv:2605.11161. It is a position paper and accessible without deep technical background in mechanistic interpretability.
2
Apply the rubric to the next interpretability result you see
Whether you are reviewing a published paper, an internal audit report, or a vendor's interpretability dashboard, write down: what specific intervention does this finding enable? What evidence exists that the intervention works? Use the answers to categorize the finding as exploratory or actionable before deciding how to act on it.
3
Map the five domains to your own system
The paper names five domains where actionable interpretability has unique leverage over behavioral testing. Read those domains and identify which apply to your AI system or use case. For each match, ask whether your current safety or audit process would catch the failure modes that actionable interpretability could surface.
4
Follow the authors' subsequent work
This position paper is a call to action for the interpretability research community. The authors represent a broad cross-section of active interpretability researchers. Follow their ongoing work to see which of the five domains attracts the first empirical follow-up that meets the paper's own bar for actionability.
5
Compare with the complementary automated interpretability synthesis
The Automated Interpretability synthesis on this site covers a technical method for doing interpretability at scale using agent loops. That paper is an example of the kind of technical work this position paper is challenging. Comparing the two clarifies the distinction between method development and the criteria for when a method's outputs are ready to act on.