Inference Optimization · Test-Time Reasoning

First surfaced in Tandemly Briefing — 2026-05-14.

Quiet reasoning,
fewer tokens.

A new method lets language models do their exploratory thinking in compressed internal representations before switching to visible text. On hard math benchmarks, this cuts tokens by 32% and improves accuracy at the same time. No fine-tuning required.

Core concept
Latent exploration: the model reasons in the continuous space of its own internal states, skipping the expensive decode-then-re-encode cycle, until it is ready to verify its answer in plain chain-of-thought.
scroll to explore

Chain-of-thought
is expensive by design.

Language models reason by generating long text chains. The same mechanism that makes them accurate makes them costly. Cutting tokens hurts quality. More tokens costs money. No clean exit from that tradeoff existed.

When a language model works through a hard problem, it reasons out loud. Each step in the chain is a sentence, a calculation, a hypothesis. The model writes down its thinking, then continues from where it left off. This is the basis of chain-of-thought reasoning, and it works well. On difficult math problems and complex reasoning tasks, generating a long reasoning chain before answering consistently outperforms generating an answer directly.

The problem is that the chains get long. A single AIME math problem can require 15,000 tokens of reasoning before the model commits to an answer. Many of those tokens are productive. Some are not. The model writes out paths it ultimately abandons, explores directions it eventually rejects, and sometimes restates the same idea in slightly different words before moving on.

The standard response is to budget the tokens. Give the model a limit and tell it to work within that limit. But token budgets are blunt instruments. When the limit runs out, the model stops, whether it has found a good answer or not. You save tokens and lose accuracy. The tradeoff felt like a fixed constraint: you get the quality the tokens buy you, and that is that.

The assumption this paper challenges

Reasoning must happen in visible text. If the model is going to explore a path, it has to write that exploration down as tokens, even if most of those tokens are intermediate scaffolding the model does not actually need. LaTER asks: what if the early exploration could happen internally, in a form that never needs to be decoded at all?

Two phases,
one model.

LaTER splits the reasoning process into an exploration phase and a verification phase. The first is quiet and cheap. The second is precise and explicit. The model decides when to switch.

The key insight is that language models do not actually need to decode their reasoning into text to continue reasoning. Internally, every step is a vector of numbers, a hidden state. Normally, the model decodes that hidden state into a token, writes the token to the context, and re-encodes it on the next step. That decode-then-re-encode cycle is where a lot of inference cost lives.

LaTER skips that cycle during the exploratory phase. Instead of decoding the hidden state into a token, it projects the final-layer hidden state back to the input embedding space and feeds it directly into the next step. The model is reasoning in the continuous space of its own internal representations. No text is generated. The key-value cache from the latent phase is preserved across this entire stretch, so no computation is wasted when the model eventually switches to explicit text.

1
Latent exploration phase
The model projects its final-layer hidden states back into the input embedding space and feeds them directly into the next step. No decoding. No re-encoding. The key-value cache is preserved across this phase. The model explores in its own internal representation space without generating any visible text.
2
Switching decision
Two signals govern when to exit the latent phase: token entropy (a measure of how uncertain the model is about its next token) and a stop-token probe (whether the model's own stopping instinct is activating). When confidence crosses a threshold, the system transitions. The switch is model-driven, not fixed by a step count.
3
Explicit verification phase
Standard chain-of-thought resumes from where the latent phase ended, with the full KV cache intact. The model verifies its reasoning and generates the final answer in plain text. Because the exploration already narrowed the search space, this phase is shorter and more focused than an unguided chain-of-thought run.
Training-free design

The base version of LaTER requires no changes to model weights. The projection from hidden states back to embedding space uses the existing model architecture. The entropy and stop-token probes read from the model's existing outputs. The entire method is a wrapper around standard inference: no new training, no new parameters.

A fine-tuned variant exists and performs stronger. But the no-finetuning result is the one that matters most for immediate production use: plug in, measure, decide.

Fewer tokens.
Better answers.

The headline result is the accuracy improvement alongside the cost reduction. Normally those move in opposite directions. On AIME 2025, they did not.

Token reduction
32%
training-free, Qwen3-14B, AIME 2025
AIME 2025 accuracy
73.3%
up from 70.0% baseline
Fine-tuned AIME 2025
80.0%
33% fewer tokens vs. SFT baseline
Standard chain-of-thought
15,730 tokens. All visible. Every exploratory dead end is written out in full text. The model generates and then abandons reasoning paths that never needed to be decoded at all. Accuracy: 70.0%.
LaTER (training-free)
10,661 tokens. Early exploration happens in latent space: invisible and cheap. The model surfaces into explicit text only for verification. Accuracy: 73.3%. Fewer tokens and a better answer.
Finding 1: The token savings do not come from reasoning less

The initial assumption would be that latent exploration is just cutting the exploratory tails from chains of thought, leaving the same core reasoning intact. The accuracy improvement suggests something more interesting: the latent phase may be doing exploration that is genuinely productive. The model arrives at the explicit verification phase already oriented, and the verification is tighter as a result. Verbose chains include noise, not just signal.

Finding 2: Across benchmarks, reductions range 16–32%

AIME 2025 is the strongest result, but the method holds across other benchmarks at 16–32% token reduction with matched or improved accuracy. The range reflects how much of a given chain-of-thought is exploratory versus verification-focused, which varies by problem type. The entropy and stop-token thresholds can be tuned per task to target a specific tradeoff.

Finding 3: Fine-tuning pushes further

The fine-tuned variant reaches 80.0% on AIME 2025 with 33% fewer tokens than the standard supervised fine-tuning baseline. This suggests the model can learn to use the latent phase more deliberately when given training signal. The training-free version is the floor, not the ceiling.

Scope and limitations

Results are reported on Qwen3-14B. Generalization across model families, sizes, and architectures requires further testing. The latent phase relies on the model reasoning effectively in its own representation space, which may behave differently across architectures. The switching thresholds need per-task calibration, which adds implementation overhead that a simple token-budget approach does not have.

What this means
for builders.

The token savings are real and immediate. The accuracy improvement is the more important signal: it suggests that verbose chain-of-thought includes noise the model does not need, and that noise can be filtered at the source.

1
For inference teams and production deployments
A 16–32% token reduction on hard reasoning tasks is direct cost savings at scale. The training-free version requires no model changes. Profile your current token counts on target tasks, apply LaTER, and measure. The method performs unevenly across task types, so benchmark before committing to a production configuration.
2
For system architects
LaTER is a pattern, not just a paper. The core idea of doing coarse exploration in a cheaper representation before switching to precise, expensive generation has broader application. Any system where reasoning involves extensive trial-and-error before a committed output is a candidate for this kind of two-phase split.
3
For evaluators and benchmark designers
The accuracy improvement alongside token reduction is the result that challenges a common assumption. If the verbose version scores lower than the quieter version, then token count is not a reliable proxy for reasoning effort. Benchmarks that reward longer chains-of-thought may be measuring verbosity as much as quality.
4
A note on threshold tuning
The entropy and stop-token thresholds that control the latent-to-explicit switch are configurable but require calibration. The paper includes ablations on this. For production use, plan for a tuning step per task type rather than assuming a single setting generalizes. This is the main operational cost of LaTER compared to a fixed-budget approach.

Where to go
from here.

If you want to go deeper.

1
Read the paper
Li, X., Wang, Y., Liu, Y., Liu, G., Qiu, D., Liu, S., Liang, J., Huang, W., Yu, J., & Zhu, J. (2026). LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification. arXiv:2605.07315.
2
Find the code release
The authors state that code, data, and model are publicly available. Check the paper for the GitHub link at time of reading. The training-free version runs on top of standard Qwen3-14B inference without any weight changes.
3
Run the training-free version first
No fine-tuning required. Start there. Profile token usage and accuracy on your target benchmarks before investing in a fine-tuned variant. The base results on Qwen3-14B are already strong enough to evaluate whether the approach fits your task.
4
Benchmark your own tasks
Token reductions range from 16–32% across benchmarks. Profile your specific task type rather than assuming the AIME result generalizes directly. Tasks with heavier verification requirements relative to exploration may see smaller latent-phase benefit.
5
Review the switching-threshold ablations
The paper includes ablation studies on how entropy and stop-token thresholds affect the latent-to-explicit transition. Read this section before setting production thresholds, since optimal values vary by task type and model temperature setting.