The model scored 19%.
The system scored 48%.
First surfaced in Tandemly Briefing, 2026-05-07.
Google DeepMind wrapped Gemini 3.1 Pro in a hierarchical multi-agent workbench: parallel workstreams, stored failure records, enforced review cycles. The base model scored 19% on FrontierMath Tier 4. The full system scored 48% on the same benchmark. The model didn't change. The scaffolding did.
Hard math needs
a different kind of process.
Expert-level mathematical research isn't a retrieval task. It's a long, failure-prone process that spans weeks. Most AI math tools weren't built for that.
Most AI tools treat mathematical reasoning as a prompt-in, answer-out exchange. The model sees a problem, produces output, and the result is either right or wrong. When it's wrong, the failed attempt is discarded. The next call starts from scratch.
Real mathematical research doesn't work that way. A researcher might spend days on a proof attempt, find a flaw, and then build the next approach on what that specific failure revealed. The failed proof isn't wasted effort. It's evidence about the structure of the problem.
When FrontierMath Tier 4 classifies its problems as things that take expert mathematicians weeks to solve, the implication is that solving them requires more than a capable model. It requires a capable process: one that holds multiple competing approaches at once, reviews its own outputs for errors, and retains what it learned from the approaches that didn't work.
How much of the gap between what AI achieves on hard mathematical problems and what it could achieve is a model capability problem, and how much is an architecture problem?
A hierarchy of agents
built for long-horizon work.
The system is organized in three layers. The key design choices are about memory and review, not just capability.
The AI Co-Mathematician is a workbench, not a single model. It wraps Gemini 3.1 Pro in a hierarchical multi-agent structure with specific roles at each level. The design mirrors how a research team works rather than how a solo researcher works alone.
When given a research task, the system breaks it into parallel workstreams that explore the problem from multiple directions simultaneously. One builds a library of relevant mathematical results. Another reviews existing literature for prior work. A third searches for counterexamples that would rule out promising approaches.
Every candidate proof passes through a mandatory review by a dedicated verifier agent, running Gemini Deep Think in proof-checking mode. A proof that fails review is not deleted. It's stored as a record of what was tried and why it didn't work. That stored failure is available to all subsequent agents working the same problem.
The Kourovka Notebook is a collection of unsolved group theory problems maintained since 1965. Problem 21.10 had been open for over sixty years. An Oxford mathematician worked through it using the AI Co-Mathematician.
The system launched two parallel workstreams: one attempting a proof, one hunting for a counterexample. The first workstream produced a proof. The reviewer agent examined it and flagged a specific flaw. The mathematician read the rejected proof alongside the identified flaw. That context let the researcher see exactly how to close the gap. The AI didn't solve the problem. It produced a failure specific enough to point toward the solution.
The scaffolding was worth
29 percentage points.
Same underlying model. Two very different results. The entire gap between 19% and 48% came from how the agents were organized around the same weights.
FrontierMath Tier 4 is a set of 48 non-public problems classified as requiring weeks of expert work. Gemini 3.1 Pro alone solved 9 of them. The full AI Co-Mathematician system solved 23. The 14-problem difference comes entirely from the architectural choices: parallel workstreams, stored failures, and mandatory review cycles.
Three of the 23 problems solved by the full system had not been solved by any previously evaluated AI. The base model hadn't reached those three either.
The case study is worth understanding precisely. The AI did not produce a correct proof. It produced a specific, well-reasoned incorrect proof. The reviewer agent identified the flaw. That combination, a rejected proof alongside an identified weakness, gave the human mathematician exactly the information needed to close the argument independently. The AI's failure was informative in a way that a blank rejection wouldn't have been.
FrontierMath Tier 4 problems have definite answers. Open-ended mathematical research often doesn't. The Kourovka result is an illustrative case, not a controlled study. The authors present it as a demonstration of the system's potential. This is early-stage work and the qualitative collaboration evidence is intentionally preliminary.
What this means
for building with AI.
A 29-point performance gap on the same model is larger than most model upgrades produce. That's worth factoring into decisions about where to invest architectural effort.
Where to go
from here.
If you want to go deeper.