First surfaced in Tandemly Briefing — 2026-05-14.
Quiet reasoning,
fewer tokens.
A new method lets language models do their exploratory thinking in compressed internal representations before switching to visible text. On hard math benchmarks, this cuts tokens by 32% and improves accuracy at the same time. No fine-tuning required.
Chain-of-thought
is expensive by design.
Language models reason by generating long text chains. The same mechanism that makes them accurate makes them costly. Cutting tokens hurts quality. More tokens costs money. No clean exit from that tradeoff existed.
When a language model works through a hard problem, it reasons out loud. Each step in the chain is a sentence, a calculation, a hypothesis. The model writes down its thinking, then continues from where it left off. This is the basis of chain-of-thought reasoning, and it works well. On difficult math problems and complex reasoning tasks, generating a long reasoning chain before answering consistently outperforms generating an answer directly.
The problem is that the chains get long. A single AIME math problem can require 15,000 tokens of reasoning before the model commits to an answer. Many of those tokens are productive. Some are not. The model writes out paths it ultimately abandons, explores directions it eventually rejects, and sometimes restates the same idea in slightly different words before moving on.
The standard response is to budget the tokens. Give the model a limit and tell it to work within that limit. But token budgets are blunt instruments. When the limit runs out, the model stops, whether it has found a good answer or not. You save tokens and lose accuracy. The tradeoff felt like a fixed constraint: you get the quality the tokens buy you, and that is that.
Reasoning must happen in visible text. If the model is going to explore a path, it has to write that exploration down as tokens, even if most of those tokens are intermediate scaffolding the model does not actually need. LaTER asks: what if the early exploration could happen internally, in a form that never needs to be decoded at all?
Two phases,
one model.
LaTER splits the reasoning process into an exploration phase and a verification phase. The first is quiet and cheap. The second is precise and explicit. The model decides when to switch.
The key insight is that language models do not actually need to decode their reasoning into text to continue reasoning. Internally, every step is a vector of numbers, a hidden state. Normally, the model decodes that hidden state into a token, writes the token to the context, and re-encodes it on the next step. That decode-then-re-encode cycle is where a lot of inference cost lives.
LaTER skips that cycle during the exploratory phase. Instead of decoding the hidden state into a token, it projects the final-layer hidden state back to the input embedding space and feeds it directly into the next step. The model is reasoning in the continuous space of its own internal representations. No text is generated. The key-value cache from the latent phase is preserved across this entire stretch, so no computation is wasted when the model eventually switches to explicit text.
The base version of LaTER requires no changes to model weights. The projection from hidden states back to embedding space uses the existing model architecture. The entropy and stop-token probes read from the model's existing outputs. The entire method is a wrapper around standard inference: no new training, no new parameters.
A fine-tuned variant exists and performs stronger. But the no-finetuning result is the one that matters most for immediate production use: plug in, measure, decide.
Fewer tokens.
Better answers.
The headline result is the accuracy improvement alongside the cost reduction. Normally those move in opposite directions. On AIME 2025, they did not.
The initial assumption would be that latent exploration is just cutting the exploratory tails from chains of thought, leaving the same core reasoning intact. The accuracy improvement suggests something more interesting: the latent phase may be doing exploration that is genuinely productive. The model arrives at the explicit verification phase already oriented, and the verification is tighter as a result. Verbose chains include noise, not just signal.
AIME 2025 is the strongest result, but the method holds across other benchmarks at 16–32% token reduction with matched or improved accuracy. The range reflects how much of a given chain-of-thought is exploratory versus verification-focused, which varies by problem type. The entropy and stop-token thresholds can be tuned per task to target a specific tradeoff.
The fine-tuned variant reaches 80.0% on AIME 2025 with 33% fewer tokens than the standard supervised fine-tuning baseline. This suggests the model can learn to use the latent phase more deliberately when given training signal. The training-free version is the floor, not the ceiling.
Results are reported on Qwen3-14B. Generalization across model families, sizes, and architectures requires further testing. The latent phase relies on the model reasoning effectively in its own representation space, which may behave differently across architectures. The switching thresholds need per-task calibration, which adds implementation overhead that a simple token-budget approach does not have.
What this means
for builders.
The token savings are real and immediate. The accuracy improvement is the more important signal: it suggests that verbose chain-of-thought includes noise the model does not need, and that noise can be filtered at the source.
Where to go
from here.
If you want to go deeper.