AI Maths Benchmark 2026: Models Still Failing Basic Sums

The promise of artificial intelligence as a tool for everyday decision-making rests, in part, on a deceptively simple question: can these systems do basic maths? According to a new round of benchmark testing released exclusively to The Register, the answer in early 2026 remains a qualified no, though with some meaningful improvement since the same tests were conducted late last year.

Researchers affiliated with Omni Calculator, a Poland-based maker of specialised online calculators, subjected four of the most widely used large language models to the company's ORCA (Omni Research on Calculation in AI) Benchmark, a suite of 500 practical, real-world maths problems spanning domains from finance and physics to health and biology. The results suggest that progress is real but modest, and that a structural problem at the heart of how these systems work has not gone away.

The top performer in this second round of testing was Google's Gemini 3 Flash, which achieved 72.8 per cent accuracy, a gain of 9.8 percentage points over its predecessor, Gemini 2.5 Flash. That improvement is genuine, yet a score of 72.8 per cent would earn a student a C on most Australian university grading scales. DeepSeek V3.2 (stable release) reached 55.2 per cent, up 3.2 percentage points from its earlier alpha version. ChatGPT 5.2 recorded 54.0 per cent accuracy, an improvement of 4.6 percentage points. The sole regressor was xAI's Grok 4.1, which slipped to 60.2 per cent, a fall of 2.6 percentage points from its predecessor, Grok 4.

Image of chart showing ORCA test results for AI models — ORCA Benchmark results comparing Gemini 3 Flash, Grok 4.1, DeepSeek V3.2 and ChatGPT 5.2 across 500 practical maths problems. Source: Omni Calculator / The Register.

What often goes unmentioned in breathless coverage of AI capabilities is the question of consistency. The ORCA researchers measured what they call "instability", tracking how often a model changed its answer when asked the same question a second time. Gemini 3 Flash was the most consistent, altering its incorrect responses only 46.1 per cent of the time. ChatGPT revised its answer 65.2 per cent of the time, while DeepSeek V3.2 changed its response for 68.8 per cent of errors. That last figure is striking: ask DeepSeek the same question twice and, when it gets the answer wrong, there is better than a two-in-three chance it will give a different wrong answer the second time.

Dawid Siuda, a researcher at ORCA, framed the problem in terms that anyone who has used these tools will recognise. "A calculator is predictable," he said in a statement provided alongside the test results. "Ask it the same question today or next year, and the answer stays the same. AI doesn't work that way." In correspondence with The Register, Siuda elaborated: "AI models are essentially prediction engines rather than logic engines. Because they work on probability, they are basically guessing the next most likely number or word based on patterns they have seen before. It is like a student who memorizes every answer in a math book but never actually learns how to add."

The domain-level results add texture to that picture. Gemini 3 Flash reached 93.2 per cent accuracy on straightforward Maths and Conversions problems, up from 83 per cent. DeepSeek made striking gains in Biology and Chemistry, rising from 10.5 per cent to 43.9 per cent. Grok 4.1, by contrast, lost 9 percentage points in Health and Sports and 5.3 percentage points in Biology and Chemistry. The researchers speculate that recent updates to Grok may have prioritised capabilities other than quantitative reasoning, a reminder that the development priorities of commercial AI labs do not always align with the metrics that matter most to end users.

A particularly pointed finding concerns the nature of the errors themselves. Calculation mistakes now account for 39.8 per cent of all errors, up from 33.4 per cent in the earlier round, while rounding errors fell from 34.7 per cent to 25.8 per cent. The ORCA team interprets this as evidence that models are becoming more sophisticated at presenting answers in a clean, formally correct style while the underlying arithmetic remains unreliable. In short, AI is getting better at looking right without actually being right.

It would be unfair, however, to dismiss these findings as proof that AI development is failing. The counterargument has genuine force. The improvement from the November 2025 baseline to February 2026 is tangible across three of the four models tested, and the pace of model releases has accelerated dramatically over this period. Those who work closely with these systems point out that the ORCA benchmark tests raw model outputs rather than the hybrid architectures that major commercial deployments actually use. Both Google and OpenAI already employ a technique called "function calling", in which the AI hands arithmetic off to a deterministic, purpose-built calculation engine rather than attempting to compute the answer itself. Siuda acknowledges this path: "Major AI companies like Google and OpenAI are already doing this by having the AI call a function to do the actual calculation," though he notes the difficulty escalates with long, multi-step problems where the model must track intermediate results across many stages.

A separate avenue of research, reported last November in Nature, points to formal mathematical proofs as another potential solution. Google DeepMind developed an approach using reinforcement learning grounded in proofs written with the Lean proof assistant, which achieved a silver-medal result at the International Mathematical Olympiad. That is an impressive research outcome, though it addresses a very different class of problem from the mortgage calculations and medication dosages that ordinary users are more likely to need.

The strategic calculus here involves several competing considerations. Governments and regulators in Australia and globally are still working out how to govern AI deployment in high-stakes contexts, from medical advice to financial planning. The ORCA results, drawn as they are from the kinds of calculations that affect real-world decisions, offer a useful data point for that conversation. An error rate of 27 per cent, even from the best-performing model, is not an acceptable baseline for clinical or legal applications, and it is precisely the sort of evidence that should inform the Australian government's ongoing AI regulatory framework discussions.

The evidence, though incomplete, suggests that the industry is moving in the right direction, even if the pace is uneven and the destination remains distant. Siuda himself concludes that closing the arithmetic gap entirely is probably impossible with current language model architecture alone, but that combining these systems with well-integrated function-calling tools may offer a practical path forward. That hybrid model, rather than the pure LLM, is likely where mature AI deployment will land. The honest answer to whether AI can do your maths for you in 2026 is: sometimes, and with a meaningful probability of getting it wrong, particularly when the problem is long or unusual. For now, the calculator on your phone remains the more trustworthy colleague.