Research

Curious learnings from the AI frontier. Papers we read, summaries we wrote, things that surprised us. Not for profit, just for understanding.

Dual-Dimensional Consistency: Smarter Self-Consistency Sampling
Xu, Li, Zhao, Wu, Li & Yan · Xi'an Jiaotong University · 2026

Synthesized June 2, 2026 · DDC combines confidence-weighted Bayesian voting and trend-aware path pruning to make self-consistency sampling adaptive. Over 10x token reduction at matched or improved accuracy across five reasoning benchmarks
Inference Optimization
Agentic Systems as Boosting: When Weak Models Beat the Frontier
Sunkaraneni, Beneventano, Neumarker, Poggio & Galanti · MIT & Texas A&M · 2026

Synthesized May 30, 2026 · A committee of nano-scale models hit 76.4% on SWE-bench Verified, matching Gemini 3 Pro and Claude Opus 4.5 Thinking standalone. The framework explains when and why committees of weak models match frontier accuracy: the task needs a local verifier
Agent Architecture
Is Grep All You Need? The Agent Harness Moves Accuracy More Than the Retrieval Method
Sen, Kasturi, Lumer, Gulati, Subbiah et al. · 2026

Synthesized May 28, 2026 · Testing grep vs vector retrieval across four agent harnesses (Chronos, Claude Code, Codex, Gemini CLI) reveals the harness layer shifts accuracy more than the retrieval method; Claude Code favors grep, Gemini CLI favors vector
Agentic Search
BoundaryRouter: Learning When to Escalate to an Agent
Wang, Qiu et al. · Princeton, Michigan, Tsinghua et al. · 2026

Synthesized May 23, 2026 · A training-free cold-start router builds experience memory from a small seed set to route queries between plain LLM inference and full agent execution; 60.6% inference time reduction vs always-agent, 28.6% accuracy gain vs always-LLM
Agent Routing
LaTER: Latent-Phase Reasoning Cuts Tokens 32% Without Losing Accuracy
Li, Wang, Liu et al. · 2026

Synthesized May 18, 2026 · A training-free two-phase method explores in latent space before switching to explicit chain-of-thought. On Qwen3-14B: 32% token reduction, AIME 2025 accuracy improves from 70.0% to 73.3%
Inference Optimization
ComplexMCP: Three Failure Modes in Large-Scale Tool Sandboxes
Li, Yang, Wang et al. · 2026

Synthesized May 17, 2026 · A 150+ tool MCP benchmark exposes three reproducible failure modes in frontier LLM agents: tool-retrieval saturation, over-confidence skipping verification, and strategic defeatism
Agent Evaluation
STALE: When Agent Memory Becomes a Liability
Chao, Bai et al. · 2026

Synthesized May 15, 2026 · A 1,200-query benchmark probes whether frontier LLMs can detect when stored memories have been silently invalidated. Best model scores 55.2%. Implicit conflict is the dominant failure mode
Agent Memory
AI Co-Mathematician: When Scaffolding Beats the Model
Zheng, von Glehn, Zwols et al. · Google DeepMind · 2026

Synthesized May 14, 2026 · The same base model scores 19% alone and 48% inside a multi-agent workbench with parallel workstreams, stored failure records, and enforced review cycles
Agentic AI
Meta-Harness: The 6x Gap Lives in Your Code, Not Your Model
Lee, Nair, Zhang, Lee, Khattab & Finn · Stanford University & MIT · 2026

Synthesized May 13, 2026 · Fixing the model and varying only the surrounding harness code produces a 6x performance gap. Meta-Harness automatically searches for better harness code using full diagnostic history, beating SOTA on text classification, IMO-level math reasoning, and agentic coding
AI Systems
How Coding Agents Actually Perform in the Wild
Popescu, Gros, Botocan, Pandita, Devanbu & Izadi · TU Delft & UC Davis · 2026

Synthesized May 12, 2026 · 110,000 open-source pull requests from five coding agents reveal that agent code gets merged but churns faster than human-authored code over time
Software Engineering
Conversation Reduces Load. Images Build It.
Taneja, Singh & Goel · Georgia Institute of Technology · 2026

Synthesized May 11, 2026 · A 124-person randomized controlled trial found multimodal conversational AI produces better biology learning outcomes than text-only chat or semantic search, explained through two distinct cognitive load mechanisms
AI in Education
Image Generation Diversity: When Models Miss the Map
Dombrowski, Zhang, Cechnicka, Reynaud & Kainz · FAU Erlangen-Nürnberg & Imperial College London · 2025

Synthesized May 10, 2026 · No state-of-the-art image generator covers more than 77% of its training distribution, and standard metrics like FID can't detect the gap. A new metric (IRS) and a model variant (DiADM) address both problems
Generative AI
SLOW: The AI Tutor That Thinks Before It Speaks
Wei, Li & Jiang · Shanghai Institute of AI for Education · 2026

Synthesized May 8, 2026 · A four-stage reasoning workspace that separates cognitive diagnosis, stability validation, affect prediction, and strategy selection before any tutoring response is generated
AI Tutoring
Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
Tran & Kiela · Stanford University · 2026

Synthesized May 7, 2026 · When you control for thinking-token compute, the multi-agent advantage on multi-hop reasoning largely disappears across five architectures and three model families
Agent Architecture
LLMs in Games: When Generated Content Runs the Rules
Johnson, Ahmed, Lang, Thethi, Zheng & de Souza Santos · University of Calgary · 2026

Synthesized April 19, 2026 · Student developers embedded LLMs as architectural components in two games. The finding: model errors turned into fairness violations, not cosmetic bugs
Game Development
Arknights: When the AI Lies, Players Learn
Shuai Guo · Uppsala University · 2025

Synthesized April 9, 2026 · A mobile strategy game deliberately gives players unreliable AI guidance, reshaping agency from action to interpretation
Explainable AI
Vibe Coding: Flow, Trust, and Co-Creation
Pimenova, Fakhoury, Bird, Storey & Endres · U Michigan / Microsoft Research · 2025

Synthesized April 4, 2026 · The first qualitative study of vibe coding reveals a new programming paradigm built on flow and calibrated AI trust
Vibe Coding
Games That Teach AI Ethics
Solyst, Nakigozi, Fong & Shapiro · University of Washington · 2025

Synthesized April 3, 2026 · Two multiplayer games use text-to-image AI to teach teens about bias through play
AI Education
BAVT: Spend Less, Reason Better
Li et al. · UBC / Vector Institute · 2026

Synthesized March 25, 2026 · Budget-Aware Value Trees cut AI agent costs by 75% with equal or better accuracy
AI Agents

These summaries are layperson interpretations of published research. They are not peer-reviewed and may simplify or omit nuance. Always refer to the original papers for complete findings.

Synthesized by Kelly Chiang & Claude.