Not every token
earns the same budget.
Every token in an LLM's output gets identical compute, whether it is the word "the" or a novel mathematical reasoning step. Cornell researchers built a small policy network that watches what the model is thinking at each step and decides how hard that step actually needs to run. The result: up to 7.3 more points on MMLU for the same number of FLOPs.
Uniform compression
ignores what the model knows.
Every standard inference optimization technique -- quantization, pruning, attention sparsity -- picks a single budget and applies it identically to every token in every generation. But the model already knows which steps are easy. It is just never asked.
When you compress a language model for production serving, you make a single global decision: run at 4-bit, prune 30% of attention heads, skip some MLP activations. That setting applies to every decode step, whether the model is generating a filler word with near-certain probability or working out which argument in a multi-step legal brief is most relevant.
This is wasteful by design. Token difficulty is not uniform. Common words and predictable continuations could be generated with a fraction of the compute. Genuinely hard tokens, where the model's internal probability distribution is uncertain and the hidden state is active, warrant the full budget. The current approach has no way to tell the difference.
There has been work on dynamic compute at inference time, but most of it operates at the query level: decide once whether to route the whole request to a lighter model, or terminate a chain-of-thought early. SOL asks a finer question: within a single generation, what should the budget be for each individual step?
Can a small policy network learn to read the LLM's own hidden state at each decode step and select the right compute-efficiency action for that specific token, without retraining the base model and without a human specifying the schedule?
A policy that watches
the model think.
SOL is not a new model architecture. It is a small learned scheduler layered on top of a frozen LLM. The LLM generates as it always did. The policy decides, step by step, how hard each generation step should run.
The setup has two components. The first is a frozen base LLM, unchanged from its original weights. The second is a lightweight policy network, small enough that its own overhead does not eat the savings it produces.
At each decode step, before the LLM computes its output, the policy reads the model's current hidden state. That hidden state is already a compressed summary of everything the model knows about the context so far and what it expects to generate next. The policy uses that signal to select one of several discrete efficiency actions: how sparse to make attention, whether and how much to prune MLP intermediate layers, and which quantization bit-width to use for that step.
The action is applied to that step only. The next step gets its own fresh decision. The policy is never given a global budget allocation; it learns to match quality against a per-generation budget target expressed as a single scalar parameter at inference time.
The policy is trained via GRPO (Group Relative Policy Optimization) on teacher-forced episodes. The reward balances quality against adherence to the budget target. Because training runs on teacher-forced data, the policy learns from the same text it would see at inference without needing separate costly rollouts. The result is a policy trained end-to-end to exploit whatever signal the hidden state carries about per-token difficulty.
7.3 MMLU points
at the same FLOPs.
Matched-compute comparisons are the honest test for efficiency work. SOL was compared against uniform-budget compression at equivalent FLOPs, not at equivalent quality. The Pareto front tells the full story.
At the same FLOPs budget, SOL produced up to 7.3 points more on MMLU than uniform-budget compression baselines. The quality-efficiency Pareto front improved consistently across the experiments: at every tested compute level, SOL outperformed static compression methods. This is the meaningful comparison because it holds compute constant and shows what better scheduling of that same compute achieves.
A single scalar parameter controls the operating point at inference time. Moving that parameter shifts SOL along the Pareto front without retraining. This means a deployed system can be tuned for different latency targets on the fly.
The hidden state at each decode step encodes what the model is "thinking" about the current context and likely continuation. Low-entropy steps, where the next token is nearly certain, have different hidden state patterns than high-entropy steps, where the model is genuinely uncertain. The policy learns to distinguish these from the hidden state alone, without being told explicitly which tokens are hard. It discovers the signal through training rewards.
SOL operates at a finer granularity than most inference-time efficiency research. LaTER terminates chain-of-thought early at the reasoning-trace level. BoundaryRouter decides at query time whether to use a full agent or a lighter path. Dual-Dimensional Consistency prunes a self-consistency vote tree. All three operate on coarser scheduling units than a single decode step. SOL is orthogonal to those approaches: it modulates compute inside a single forward pass, token by token. Layering all of them is a coherent efficiency stack.
What this means
for production inference.
Per-token compute scheduling is not a research curiosity. Any team running high-volume LLM inference at a fixed FLOP budget has a reason to care about where those FLOPs are actually going.
Where to go
from here.
The paper is self-contained and the approach is well-scoped. These are the most direct next steps if you want to go deeper.
First surfaced in Tandemly Briefing — 2026-05-26.