Inference Optimization · Reasoning

Not every reasoning path
deserves equal weight.

Self-consistency sampling generates multiple reasoning chains and takes a majority vote. Every trace counts the same, whether it reasoned carefully or hallucinated confidently. Researchers at Xi'an Jiaotong University built DDC to fix both problems: weight votes by path quality, and prune weak traces before they finish.

Core concept

Dual-Dimensional Consistency (DDC): Confidence-Weighted Bayesian Voting adjusts how much each answer counts based on the quality of the trace that produced it. Trend-Aware Stratified Pruning tracks whether a reasoning trace is heading toward a good answer and cuts it early if not.

scroll to explore

First surfaced in Tandemly Briefing — 2026-05-19.

01The problem

Every path gets one vote.
Even the bad ones.

Self-consistency is reliable and widely used. Its weakness is that it treats every reasoning chain as equally trustworthy, regardless of how well that chain actually reasoned.

Self-consistency is one of the more reliable techniques for improving LLM accuracy on hard reasoning tasks. The idea is simple: generate N independent reasoning chains for a single query, then take the majority vote on the final answers. If most chains arrive at the same answer, that answer is probably right.

The problem is what the vote ignores. Some reasoning chains proceed carefully through a difficult problem and arrive at the right answer via sound logic. Others follow a plausible-sounding path and arrive at a wrong answer with the same apparent confidence. Standard self-consistency counts both the same. A hallucinated answer backed by several traces can beat a correct answer backed by fewer, regardless of how well each set of traces actually reasoned.

A second problem works at the depth level. Once a reasoning chain has started, current systems typically run it to completion before deciding whether it was worth running at all. The decision to extend or abandon a path is made after all the compute has already been spent. There is no mechanism to detect, mid-trace, that a chain has gone off the rails.

The question this paper asks

Can you build a unified framework that handles both problems at once: adjusting how much weight each path gets based on its quality, and terminating low-quality paths early before they burn unnecessary compute?

02The experiment

Weight the vote.
Prune the path.

DDC addresses two dimensions simultaneously: how answers are aggregated across paths (width), and how long each path runs (depth). The two components can work independently but are designed to reinforce each other.

Standard self-consistency

Plurality vote, fixed N. Generate N traces, count the most common answer. Each trace gets one vote. All traces run to completion. Easy and hard queries consume the same compute.

DDC

Confidence-weighted, adaptive termination. Traces are scored by quality. Votes are weighted by confidence. Weak paths are pruned mid-trace. Sampling stops early when consensus is strong enough and confident enough.

The first component is Confidence-Weighted Bayesian Voting, or CWBT. Standard self-consistency runs a vote where each trace has an equal say. CWBT replaces that with a Bayesian aggregation that weighs each answer by a confidence score derived from the trace that produced it. Traces that reasoned soundly get more influence over the final answer. Traces that wandered or contradicted themselves carry less weight.

The system also uses this confidence information to decide when to stop sampling. Rather than always generating a fixed N traces, CWBT terminates early when two conditions are met: the answers across traces have converged sufficiently, and the confidence in the converging answer is high enough. Easy queries that all traces answer the same way stop early. Hard queries where traces diverge continue toward N.

The second component is Trend-Aware Stratified Pruning, or TASP. This operates during generation of each individual trace. TASP treats the emerging reasoning text as a signal over time, similar to how a moving average separates trend from noise in a data series. It identifies the underlying direction of a trace and compares it to the high-quality peer traces being generated. When a trace's trajectory diverges clearly from its well-performing peers, TASP prunes it before it completes, recovering the compute for the traces still worth running.

What "trend-aware" means here

A reasoning trace does not proceed in a straight line. Some start promisingly and veer off midway. Others recover from a shaky beginning. TASP tracks trajectory, not just current state. A trace heading toward a wrong answer can be caught early even if the most recent tokens still look reasonable in isolation. This is the key difference from cutoff-based pruning, which just stops traces at a fixed length.

03Findings

On five benchmarks,
fewer tokens, matched accuracy.

The researchers evaluated DDC across five reasoning benchmarks using open-weight LLMs. Token reduction varied by configuration, but accuracy held or improved across the conditions tested.

AIME 2025 accuracy gain

+15.6%

over standard self-consistency, Qwen3-4B

Token reduction (strong baseline)

~27x

in favorable configurations

Benchmarks evaluated

reasoning benchmarks, open-weight LLMs

Finding 1: Quality weighting beats plurality vote

On AIME 2025, CWBT's confidence-weighted aggregation improved accuracy by 15.6 percentage points over standard self-consistency on Qwen3-4B. The gain comes from two sources: correct traces get more influence in the vote, and sampling terminates before adding noise from traces that would have weakened the aggregate. The accuracy improvement holds even before factoring in the token reduction.

Finding 2: Token reduction is real but context-dependent

The headline figure is more than 10x token reduction at matched or improved accuracy across multiple model families. The 27x figure represents a specific comparison against a strong baseline running at high N. The 10x range is the more consistent planning number: it appears across models and benchmarks, and reflects the cumulative effect of early termination for easy queries and mid-trace pruning for low-quality paths. What the number means in practice depends on how much N you were running before and how hard your query distribution is.

Finding 3: Accuracy does not degrade under large compute reductions

The paper's cleaner claim is that DDC does not trade accuracy for tokens. The technique concentrates compute on the traces most likely to reach the right answer, so cutting total trace count via pruning and early termination does not cost accuracy. In several benchmark configurations accuracy improves, which the authors attribute to CWBT's quality-weighted voting being a better aggregation mechanism than simple plurality, not just a cheaper one.

Scope and limitations

The evaluation covers five benchmarks using open-weight LLMs. The largest token reduction figures come from strong-baseline comparisons at high N. Tasks that benefit from exploring diverse reasoning strategies may gain less from path pruning than tasks with a single correct logical path. The authors do not provide a formal guarantee that DDC outperforms standard self-consistency on all task types, and the exact reduction you see will depend on your model, your N, and your query distribution.

04Practical takeaways

What this means
for production reasoning.

DDC requires no fine-tuning and applies to any pipeline already running self-consistency sampling. For anyone spending real compute on multi-trace reasoning, both components are worth evaluating independently before combining them.

For teams running self-consistency at scale

CWBT's confidence-weighted voting is a drop-in replacement for plurality vote in an existing self-consistency pipeline. It does not require a new model, new training data, or fine-tuning. Start there and measure the accuracy delta before adding TASP path pruning. The two components compound but also decouple for evaluation.

For evaluators comparing test-time scaling techniques

DDC sits in a different part of the design space from LaTER (which optimizes the depth of a single trace via latent-space exploration) and BoundaryRouter (which routes between a base LLM and a full agent). The three are complementary: DDC works inside the self-consistency loop, LaTER changes how each individual trace reasons, and BoundaryRouter decides whether to invoke any of this at all. All three can be combined in a single system.

On trusting the 10x number

Measure token reduction on your own task distribution before treating any published figure as a planning assumption. The reduction varies with baseline N, query difficulty, and how confidently traces tend to converge on your specific problem type. A 3x reduction on an easier workload is still worth capturing. Run the comparison with your actual query mix.

For high-stakes reasoning applications

The paper's result that quality-weighted voting outperforms plurality vote has implications beyond cost. Even if token budget is not a concern, using confidence weighting means the final answer is more likely to reflect the traces that demonstrated careful reasoning rather than the traces that were simply more numerous. The accuracy benefit and the efficiency benefit come from the same source: identifying which traces are actually worth listening to.

05Further exploration

Where to go
from here.

Steps to evaluate DDC for your workload, and related work worth reading alongside it.

Read the paper

Xu, R., Li, Y., Zhao, T., Wu, Y., Li, B., & Yan, H. (2026). Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling. Xi'an Jiaotong University. arXiv:2605.15100.

Implement CWBT first

Confidence-weighted aggregation is the simpler component to add. Replace your majority vote with a Bayesian aggregation weighted by per-trace confidence scores, then add early termination based on convergence and confidence thresholds. Measure accuracy and token count before and after. Get this working before adding path pruning.

Benchmark across query difficulty tiers

Evaluate separately on easy, medium, and hard queries from your production distribution. DDC's largest gains appear on hard queries where traces diverge in quality. Easy queries that converge quickly still benefit from early termination, but the accuracy improvement is concentrated at the hard end. Knowing this breakdown tells you where the technique is earning its keep.

Read LaTER for the complementary technique

LaTER (arXiv:2605.07315) applies to individual traces: it explores reasoning in latent space before switching to explicit chain-of-thought, cutting trace length without cutting accuracy. DDC applies to the multi-trace sampling loop. Together they address two separate points in the inference pipeline and can be deployed in combination.

Review BoundaryRouter for the upstream routing decision

BoundaryRouter (arXiv:2605.07180) decides, per query, whether to run any self-consistency sampling at all or to return to a direct model call. Its RouteBench evaluation uses a three-split structure (in-domain, paraphrased, OOD) that is useful for understanding how any inference optimization generalizes as query distributions shift over time.

Not every reasoning pathdeserves equal weight.

Every path gets one vote.Even the bad ones.

Weight the vote.Prune the path.

On five benchmarks,fewer tokens, matched accuracy.

What this meansfor production reasoning.

Where to gofrom here.

Not every reasoning path
deserves equal weight.

Every path gets one vote.
Even the bad ones.

Weight the vote.
Prune the path.

On five benchmarks,
fewer tokens, matched accuracy.

What this means
for production reasoning.

Where to go
from here.