News

Gemini, ChatGPT or Grok: Which AI chatbot is best at calculation accuracy?

Artificial intelligence has become an integral part of our lives, including everyday calculations, leading many to wonder how well these systems can actually handle basic math

December 30, 2025

Gemini, ChatGPT or Grok: Which AI chatbot is best at calculation accuracy?

Researchers conducted a surprising study to analyze the accuracy of five AI models using 500 everyday math prompts. The intriguing results showed that there is a 40 percent chance that an AI will get the answer wrong.

The Omni Research on Calculation in AI (ORCA) primarily shows that when you ask an AI chatbot to perform everyday math, the accuracy varies significantly across AI companies and across distinct types of mathematical tasks.

The chosen models are:

Gemini 2.5 Flash (Google)
ChatGPT-5(OopenAI)
DeepSeek V3.2(DeepSeek AI)
Grok-4(xAI)

The results demonstrated that no AI model scored above 63 percent in everyday math. The prominent leader, Gemini (63 per cent) still gets nearly 4 out of 10 problems wrong.

Grok achieved almost the same score at 62.8 per cent, while DeepSeek ranks third at 52 per cent, followed by ChatGPT with 49.4 per cent.

AI accuracy peaks in Math & conversations, hits record low in physical tasks

Performance varies across distinct categories; in Math and Conversions (147 of the 500 prompts), Gemini leads with 83%, followed by Grok at 76.9% and DeepSeek at 74.1%.

According to euro news, ChatGPT scored 66.7 percent in this category, while the accuracy across all five models is 72.1 percent, marking the highest among the seven categories.

Users are advised to use calculators or double check with another prompt to avoid error in any case.

Four crucial mistakes made by AI models

Experts categorized mistakes into four types, noting that a primary challenge lies in translating a real-world situation into the correct formulas.

Computation errors

In these cases, AI understands the question and the formula but fails during the actual computation. This category includes precision and rounding issues (35%) and calculation errors (33 %).

Faulty Logic errors

This type of error is one of the most serious because it shows the AI is struggling to understand the actual cause of the problem. These include method or formula errors (14%) such as using an incomplete mathematical approach, and wrong assumptions, which account for up to 12% of mistakes.

Misinterpreting instructions

The misreading of the instructions primarily occurs when the AI fails to correctly interpret what the question is asking. Examples include using wrong parameters, logical errors and providing incomplete answers.

AI Question deflection

It has been observed that AI simply refuses or defects a question rather than attempting a specific answer. The weak spot is rounding, especially in multi-step calculations; if an error occurs at any point, the final result is commonly far off.

Nonetheless, the research used the most advanced models available to the general public for free.

The study concludes with the following insights: if you need an accurate answer for a tricky word problem, ChatGPT wins; you want to take a picture of a receipt or get an instant response as Gemini wins; and lastly, if you want speed and a concise answer, Grok is a strong choice.

The results demonstrate that significant improvements are still needed to achieve reliable math and conversational logic.