Not every reasoning path
deserves equal weight.
Self-consistency sampling generates multiple reasoning chains and takes a majority vote. Every trace counts the same, whether it reasoned carefully or hallucinated confidently. Researchers at Xi'an Jiaotong University built DDC to fix both problems: weight votes by path quality, and prune weak traces before they finish.
First surfaced in Tandemly Briefing — 2026-05-19.
Every path gets one vote.
Even the bad ones.
Self-consistency is reliable and widely used. Its weakness is that it treats every reasoning chain as equally trustworthy, regardless of how well that chain actually reasoned.
Self-consistency is one of the more reliable techniques for improving LLM accuracy on hard reasoning tasks. The idea is simple: generate N independent reasoning chains for a single query, then take the majority vote on the final answers. If most chains arrive at the same answer, that answer is probably right.
The problem is what the vote ignores. Some reasoning chains proceed carefully through a difficult problem and arrive at the right answer via sound logic. Others follow a plausible-sounding path and arrive at a wrong answer with the same apparent confidence. Standard self-consistency counts both the same. A hallucinated answer backed by several traces can beat a correct answer backed by fewer, regardless of how well each set of traces actually reasoned.
A second problem works at the depth level. Once a reasoning chain has started, current systems typically run it to completion before deciding whether it was worth running at all. The decision to extend or abandon a path is made after all the compute has already been spent. There is no mechanism to detect, mid-trace, that a chain has gone off the rails.
Can you build a unified framework that handles both problems at once: adjusting how much weight each path gets based on its quality, and terminating low-quality paths early before they burn unnecessary compute?
Weight the vote.
Prune the path.
DDC addresses two dimensions simultaneously: how answers are aggregated across paths (width), and how long each path runs (depth). The two components can work independently but are designed to reinforce each other.
The first component is Confidence-Weighted Bayesian Voting, or CWBT. Standard self-consistency runs a vote where each trace has an equal say. CWBT replaces that with a Bayesian aggregation that weighs each answer by a confidence score derived from the trace that produced it. Traces that reasoned soundly get more influence over the final answer. Traces that wandered or contradicted themselves carry less weight.
The system also uses this confidence information to decide when to stop sampling. Rather than always generating a fixed N traces, CWBT terminates early when two conditions are met: the answers across traces have converged sufficiently, and the confidence in the converging answer is high enough. Easy queries that all traces answer the same way stop early. Hard queries where traces diverge continue toward N.
The second component is Trend-Aware Stratified Pruning, or TASP. This operates during generation of each individual trace. TASP treats the emerging reasoning text as a signal over time, similar to how a moving average separates trend from noise in a data series. It identifies the underlying direction of a trace and compares it to the high-quality peer traces being generated. When a trace's trajectory diverges clearly from its well-performing peers, TASP prunes it before it completes, recovering the compute for the traces still worth running.
A reasoning trace does not proceed in a straight line. Some start promisingly and veer off midway. Others recover from a shaky beginning. TASP tracks trajectory, not just current state. A trace heading toward a wrong answer can be caught early even if the most recent tokens still look reasonable in isolation. This is the key difference from cutoff-based pruning, which just stops traces at a fixed length.
On five benchmarks,
fewer tokens, matched accuracy.
The researchers evaluated DDC across five reasoning benchmarks using open-weight LLMs. Token reduction varied by configuration, but accuracy held or improved across the conditions tested.
On AIME 2025, CWBT's confidence-weighted aggregation improved accuracy by 15.6 percentage points over standard self-consistency on Qwen3-4B. The gain comes from two sources: correct traces get more influence in the vote, and sampling terminates before adding noise from traces that would have weakened the aggregate. The accuracy improvement holds even before factoring in the token reduction.
The headline figure is more than 10x token reduction at matched or improved accuracy across multiple model families. The 27x figure represents a specific comparison against a strong baseline running at high N. The 10x range is the more consistent planning number: it appears across models and benchmarks, and reflects the cumulative effect of early termination for easy queries and mid-trace pruning for low-quality paths. What the number means in practice depends on how much N you were running before and how hard your query distribution is.
The paper's cleaner claim is that DDC does not trade accuracy for tokens. The technique concentrates compute on the traces most likely to reach the right answer, so cutting total trace count via pruning and early termination does not cost accuracy. In several benchmark configurations accuracy improves, which the authors attribute to CWBT's quality-weighted voting being a better aggregation mechanism than simple plurality, not just a cheaper one.
The evaluation covers five benchmarks using open-weight LLMs. The largest token reduction figures come from strong-baseline comparisons at high N. Tasks that benefit from exploring diverse reasoning strategies may gain less from path pruning than tasks with a single correct logical path. The authors do not provide a formal guarantee that DDC outperforms standard self-consistency on all task types, and the exact reduction you see will depend on your model, your N, and your query distribution.
What this means
for production reasoning.
DDC requires no fine-tuning and applies to any pipeline already running self-consistency sampling. For anyone spending real compute on multi-trace reasoning, both components are worth evaluating independently before combining them.
Where to go
from here.
Steps to evaluate DDC for your workload, and related work worth reading alongside it.