Evaluation
How you measure whether the system actually works: equal-compute baselines, instrumentation that catches silent over-budgeting, and benchmarks that survive scrutiny.
No practices in this stage match the current filters.
-
WhenYou’re evaluating a multi-agent stack against a single-agent baseline, or assessing a vendor’s claim that their multi-agent design improves quality.
UseA single-agent baseline at equal thinking-token budget before adopting multi-agent. Hold the total reasoning compute constant across both architectures. If single-agent matches the multi-agent system at equal compute, the multi-agent overhead is not buying you anything on this task.
EvidenceAcross three model families and five multi-agent architectures (sequential, debate, ensemble, parallel-roles, subtask-parallel), the multi-agent advantage on multi-hop reasoning largely vanished when total thinking-token budgets were normalized. Single-agent reasoning matched or beat every multi-agent variant at every meaningful budget level above 100 tokens.
-
WhenYou’re benchmarking reasoning systems and relying on a vendor’s reported reasoning-budget parameter to control compute.
UseDirect instrumentation of actual thinking-token usage instead of trusted API budget caps. Log the realized token counts for every system under test. Treat documented budget parameters as soft hints unless you have verified they behave as hard caps on the specific model and version you’re using.
EvidenceThe authors of the equal-budget study found that Gemini’s thinkingBudget parameter does not behave like a hard cap. Actual visible-thought output often fell well below the requested budget, and API-reported token counts did not always match the visible reasoning text. Multi-agent systems that make multiple calls under the same nominal budget can be silently over-credited, distorting any comparison built on the parameter alone.