Agentic AI · Mathematical Reasoning

The model scored 19%.
The system scored 48%.

First surfaced in Tandemly Briefing, 2026-05-07.

Google DeepMind wrapped Gemini 3.1 Pro in a hierarchical multi-agent workbench: parallel workstreams, stored failure records, enforced review cycles. The base model scored 19% on FrontierMath Tier 4. The full system scored 48% on the same benchmark. The model didn't change. The scaffolding did.

Core concept

Stateful workbench: an agent architecture that stores failed proofs as research artifacts, runs parallel competing workstreams, and enforces review gates before any result is accepted. The scaffolding is the research process, not a wrapper around it.

scroll to explore

01The problem

Hard math needs
a different kind of process.

Expert-level mathematical research isn't a retrieval task. It's a long, failure-prone process that spans weeks. Most AI math tools weren't built for that.

Most AI tools treat mathematical reasoning as a prompt-in, answer-out exchange. The model sees a problem, produces output, and the result is either right or wrong. When it's wrong, the failed attempt is discarded. The next call starts from scratch.

Real mathematical research doesn't work that way. A researcher might spend days on a proof attempt, find a flaw, and then build the next approach on what that specific failure revealed. The failed proof isn't wasted effort. It's evidence about the structure of the problem.

When FrontierMath Tier 4 classifies its problems as things that take expert mathematicians weeks to solve, the implication is that solving them requires more than a capable model. It requires a capable process: one that holds multiple competing approaches at once, reviews its own outputs for errors, and retains what it learned from the approaches that didn't work.

The question this paper asks

How much of the gap between what AI achieves on hard mathematical problems and what it could achieve is a model capability problem, and how much is an architecture problem?

02The experiment

A hierarchy of agents
built for long-horizon work.

The system is organized in three layers. The key design choices are about memory and review, not just capability.

The AI Co-Mathematician is a workbench, not a single model. It wraps Gemini 3.1 Pro in a hierarchical multi-agent structure with specific roles at each level. The design mirrors how a research team works rather than how a solo researcher works alone.

When given a research task, the system breaks it into parallel workstreams that explore the problem from multiple directions simultaneously. One builds a library of relevant mathematical results. Another reviews existing literature for prior work. A third searches for counterexamples that would rule out promising approaches.

Every candidate proof passes through a mandatory review by a dedicated verifier agent, running Gemini Deep Think in proof-checking mode. A proof that fails review is not deleted. It's stored as a record of what was tried and why it didn't work. That stored failure is available to all subsequent agents working the same problem.

Project Coordinator

Manages the overall research task. Assigns workstreams, synthesizes progress across parallel investigations, and decides what gets escalated to the human researcher.

Workstream Coordinators

Three parallel investigation channels running simultaneously: literature review, mathematical library development, and counterexample search. Each can reach different conclusions from the same starting problem.

Specialized Agents

A search agent, a coding agent, and Gemini Deep Think acting as a mandatory proof reviewer. The reviewer is a required architectural step, not an optional check.

The Kourovka Notebook case

The Kourovka Notebook is a collection of unsolved group theory problems maintained since 1965. Problem 21.10 had been open for over sixty years. An Oxford mathematician worked through it using the AI Co-Mathematician.

The system launched two parallel workstreams: one attempting a proof, one hunting for a counterexample. The first workstream produced a proof. The reviewer agent examined it and flagged a specific flaw. The mathematician read the rejected proof alongside the identified flaw. That context let the researcher see exactly how to close the gap. The AI didn't solve the problem. It produced a failure specific enough to point toward the solution.

03Findings

The scaffolding was worth
29 percentage points.

Same underlying model. Two very different results. The entire gap between 19% and 48% came from how the agents were organized around the same weights.

Base model alone

19%

Gemini 3.1 Pro · FrontierMath Tier 4

Full system

48%

Same model · Better scaffolding

Problems first solved

Not cracked by any prior AI system

Standard AI math setup

Prompt in, answer out. Stateless exchange. The model attempts the problem, returns output, and the attempt is either correct or discarded. No parallel exploration. No dedicated reviewer. No stored failure history.

AI Co-Mathematician

Stateful workbench. Parallel workstreams explore proof and counterexample simultaneously. A mandatory reviewer checks every candidate output. Failed proofs are stored with their identified flaws. The workspace persists across the full research session.

The benchmark result in plain terms

FrontierMath Tier 4 is a set of 48 non-public problems classified as requiring weeks of expert work. Gemini 3.1 Pro alone solved 9 of them. The full AI Co-Mathematician system solved 23. The 14-problem difference comes entirely from the architectural choices: parallel workstreams, stored failures, and mandatory review cycles.

Three of the 23 problems solved by the full system had not been solved by any previously evaluated AI. The base model hadn't reached those three either.

What the Kourovka result actually shows

The case study is worth understanding precisely. The AI did not produce a correct proof. It produced a specific, well-reasoned incorrect proof. The reviewer agent identified the flaw. That combination, a rejected proof alongside an identified weakness, gave the human mathematician exactly the information needed to close the argument independently. The AI's failure was informative in a way that a blank rejection wouldn't have been.

Scope and limitations

FrontierMath Tier 4 problems have definite answers. Open-ended mathematical research often doesn't. The Kourovka result is an illustrative case, not a controlled study. The authors present it as a demonstration of the system's potential. This is early-stage work and the qualitative collaboration evidence is intentionally preliminary.

04Practical takeaways

What this means
for building with AI.

A 29-point performance gap on the same model is larger than most model upgrades produce. That's worth factoring into decisions about where to invest architectural effort.

For builders evaluating model upgrades

Before reaching for a better model, audit whether the existing scaffolding captures what the expert process actually requires. Parallel exploration, enforced review, and stored failure history account for most of the performance gap in this paper. That's a useful prior for any hard reasoning application.

For architects designing expert-domain AI

Stateful workspaces outperform stateless exchanges on long-horizon problems. If your system discards intermediate reasoning, doesn't track what approaches have been tried, or has no review step, you're leaving architectural performance on the table. The specific choices here are worth studying: stored failures, mandatory review, parallel workstreams.

For practitioners in other expert domains

The framing generalizes. Legal analysis, scientific hypothesis generation, financial modeling, and complex system design all involve long iterative processes where failed intermediate attempts carry informational value. Stateless prompt-in-answer-out architectures discard that value. Stateful workspaces that record the negative space of exploration don't.

A note on what the paper doesn't claim

The system helped with specific hard problems. It didn't operate independently. It needed a human to interpret the flawed proof and understand the gap. Its first proof attempt was wrong. The value came from the specificity of the failure, not from the system being right. The paper is explicit about this, and that clarity matters.

05Further exploration

Where to go
from here.

If you want to go deeper.

Read the paper

Zheng, D., von Glehn, I., Zwols, Y. et al. (2026). AI Co-Mathematician: Accelerating Mathematicians with Agentic AI. Google DeepMind. arXiv:2605.06651.

Understand the benchmark

FrontierMath, from Epoch AI, classifies its Tier 4 problems as requiring weeks of expert work. Reading a few benchmark problem descriptions helps calibrate what a 48% solve rate actually means and how far the gap between Tier 4 and "standard hard math" really is.

Audit your own system for failure storage

Most agent systems silently discard failed reasoning. Ask whether a running log of tried-and-rejected approaches would change how later agents in your system approach the same problem. The AI Co-Mathematician treats this as a core architectural decision, not a debugging aid.

Apply the parallel-workstream pattern

Instead of a single linear agent process, run a solution-attempt workstream and a refutation-search workstream simultaneously. This pattern generalizes to any domain where you're not sure in advance whether a proposed answer or its counterargument is correct: legal review, system failure analysis, product validation.

Add a mandatory review step

The reviewer agent in this system is not optional. Making review a required architectural gate rather than an advisory check is one of the specific design choices that contributed to the benchmark result. Consider where in your own agent pipeline a mandatory review step would catch the kind of specific, well-reasoned errors that are hardest to detect.

The model scored 19%.The system scored 48%.

Hard math needsa different kind of process.

A hierarchy of agentsbuilt for long-horizon work.

The scaffolding was worth29 percentage points.

What this meansfor building with AI.

Where to gofrom here.

The model scored 19%.
The system scored 48%.

Hard math needs
a different kind of process.

A hierarchy of agents
built for long-horizon work.

The scaffolding was worth
29 percentage points.

What this means
for building with AI.

Where to go
from here.