Inference Optimization · Efficient LLMs

Not every token
earns the same budget.

Every token in an LLM's output gets identical compute, whether it is the word "the" or a novel mathematical reasoning step. Cornell researchers built a small policy network that watches what the model is thinking at each step and decides how hard that step actually needs to run. The result: up to 7.3 more points on MMLU for the same number of FLOPs.

Core concept
Self-Optimizing Language Models (SOL): a frozen LLM paired with a tiny policy that reads the model's own hidden state at each decode step and picks a discrete efficiency action — attention sparsity, MLP pruning, or quantization bit-width — before that step runs.
scroll to explore

Uniform compression
ignores what the model knows.

Every standard inference optimization technique -- quantization, pruning, attention sparsity -- picks a single budget and applies it identically to every token in every generation. But the model already knows which steps are easy. It is just never asked.

When you compress a language model for production serving, you make a single global decision: run at 4-bit, prune 30% of attention heads, skip some MLP activations. That setting applies to every decode step, whether the model is generating a filler word with near-certain probability or working out which argument in a multi-step legal brief is most relevant.

This is wasteful by design. Token difficulty is not uniform. Common words and predictable continuations could be generated with a fraction of the compute. Genuinely hard tokens, where the model's internal probability distribution is uncertain and the hidden state is active, warrant the full budget. The current approach has no way to tell the difference.

There has been work on dynamic compute at inference time, but most of it operates at the query level: decide once whether to route the whole request to a lighter model, or terminate a chain-of-thought early. SOL asks a finer question: within a single generation, what should the budget be for each individual step?

The question this paper asks

Can a small policy network learn to read the LLM's own hidden state at each decode step and select the right compute-efficiency action for that specific token, without retraining the base model and without a human specifying the schedule?

A policy that watches
the model think.

SOL is not a new model architecture. It is a small learned scheduler layered on top of a frozen LLM. The LLM generates as it always did. The policy decides, step by step, how hard each generation step should run.

The setup has two components. The first is a frozen base LLM, unchanged from its original weights. The second is a lightweight policy network, small enough that its own overhead does not eat the savings it produces.

At each decode step, before the LLM computes its output, the policy reads the model's current hidden state. That hidden state is already a compressed summary of everything the model knows about the context so far and what it expects to generate next. The policy uses that signal to select one of several discrete efficiency actions: how sparse to make attention, whether and how much to prune MLP intermediate layers, and which quantization bit-width to use for that step.

The action is applied to that step only. The next step gets its own fresh decision. The policy is never given a global budget allocation; it learns to match quality against a per-generation budget target expressed as a single scalar parameter at inference time.

Prior approach: uniform compression
One budget for every token. Quantize the whole model to 4-bit. Prune 30% of attention heads. Set attention sparsity to a fixed pattern. Every decode step runs under identical constraints, regardless of whether the step is generating "the" or completing a reasoning chain.
SOL: learned per-token policy
Budget matched to difficulty. A tiny policy reads the LLM's own hidden state at the start of each decode step and picks the action for that step. Easy tokens get lighter treatment. Hard tokens get more. The base model is frozen; only the policy is trained.
Training the policy

The policy is trained via GRPO (Group Relative Policy Optimization) on teacher-forced episodes. The reward balances quality against adherence to the budget target. Because training runs on teacher-forced data, the policy learns from the same text it would see at inference without needing separate costly rollouts. The result is a policy trained end-to-end to exploit whatever signal the hidden state carries about per-token difficulty.

7.3 MMLU points
at the same FLOPs.

Matched-compute comparisons are the honest test for efficiency work. SOL was compared against uniform-budget compression at equivalent FLOPs, not at equivalent quality. The Pareto front tells the full story.

+7.3
MMLU points over uniform compression at matched FLOPs (peak result)
1
scalar parameter to set operating point without retraining
3
jointly tuned efficiency axes: attention sparsity, MLP pruning, quantization bit-width
The key result

At the same FLOPs budget, SOL produced up to 7.3 points more on MMLU than uniform-budget compression baselines. The quality-efficiency Pareto front improved consistently across the experiments: at every tested compute level, SOL outperformed static compression methods. This is the meaningful comparison because it holds compute constant and shows what better scheduling of that same compute achieves.

A single scalar parameter controls the operating point at inference time. Moving that parameter shifts SOL along the Pareto front without retraining. This means a deployed system can be tuned for different latency targets on the fly.

Why the hidden state works as a scheduling signal

The hidden state at each decode step encodes what the model is "thinking" about the current context and likely continuation. Low-entropy steps, where the next token is nearly certain, have different hidden state patterns than high-entropy steps, where the model is genuinely uncertain. The policy learns to distinguish these from the hidden state alone, without being told explicitly which tokens are hard. It discovers the signal through training rewards.

How SOL fits with other efficiency work

SOL operates at a finer granularity than most inference-time efficiency research. LaTER terminates chain-of-thought early at the reasoning-trace level. BoundaryRouter decides at query time whether to use a full agent or a lighter path. Dual-Dimensional Consistency prunes a self-consistency vote tree. All three operate on coarser scheduling units than a single decode step. SOL is orthogonal to those approaches: it modulates compute inside a single forward pass, token by token. Layering all of them is a coherent efficiency stack.

What this means
for production inference.

Per-token compute scheduling is not a research curiosity. Any team running high-volume LLM inference at a fixed FLOP budget has a reason to care about where those FLOPs are actually going.

1
For builders running inference at scale
Uniform compression leaves quality on the table because it cannot distinguish token difficulty. Before committing to a global quantization or pruning setting, profile what fraction of your model's decode steps are genuinely high-entropy and how many are near-certain predictions. If the high-entropy fraction is modest, you are over-spending compute on most of your tokens.
2
The hidden state is an underused signal
SOL demonstrates that the model's own hidden state carries actionable information about how hard a step is. That signal is available at inference time with no additional input. Any efficiency policy, not just SOL's, could potentially exploit it. If you are building custom inference serving infrastructure, treat the hidden state as a first-class scheduling input.
3
For ML researchers and engineers
The GRPO training setup trained on teacher-forced episodes is worth studying for its own sake. It provides a clean way to train a policy over a frozen model's behavior using a reward that balances quality against a budget target. That pattern generalizes beyond compute scheduling to any case where you want to train a lightweight meta-controller over a large frozen system.
4
For business leaders evaluating AI serving costs
The single-scalar operating-point control means the trade-off between quality and cost can be adjusted after deployment without retraining. That is a meaningful operational property: the same deployed system can be shifted toward higher quality when margins allow or toward lower cost during demand spikes, without a new training run.

Where to go
from here.

The paper is self-contained and the approach is well-scoped. These are the most direct next steps if you want to go deeper.

1
Read the paper
Akhauri, Y. & Abdelfattah, M. S. (2026). Compute Where it Counts: Self Optimizing Language Models. Cornell University. arXiv:2605.10875. The GRPO training setup and the policy architecture details are in the methods section and are straightforward to follow.
2
Profile your own per-token difficulty distribution
Before applying any per-token scheduling, measure the distribution of output entropy across your actual production workload. Log the top-1 probability at each decode step over a representative sample. If 80% of tokens have near-certain predictions, you have a large pool of steps where lighter treatment would cost little in quality.
3
Study the GRPO training setup using TRL
The TRL library (Hugging Face) has direct support for GRPO. The teacher-forced episode construction and the quality-budget reward function are the pieces most worth replicating if you want to adapt the approach to a different base model or a different set of efficiency actions.
4
Evaluate your compression baselines on a Pareto front
Single-point comparisons (this method at X FLOPs achieves Y accuracy) are easier to misread than a Pareto curve. When benchmarking any compression or efficiency approach, plot quality against compute across multiple operating points. SOL's advantage shows up most clearly in the shape of the curve, not at a single coordinate.
5
Consider SOL alongside trace-level and query-level scheduling
SOL is compatible with LaTER (early exit from reasoning traces), BoundaryRouter (query-level routing to lightweight inference), and Dual-Dimensional Consistency (self-consistency tree pruning). Each operates at a different granularity of the generation process. A full efficiency stack would stack all four rather than pick one.
Attribution

First surfaced in Tandemly Briefing — 2026-05-26.