× News Alerts AI News CyberSec News Let's Talk Local AI Bank Tech News Cyber Advisories Contact

Grok-4 benchmark results: Tops math, ranks second in coding

Newsroom - 16 July 2025 - 13:22

xAI's new Grok-4 artificial intelligence model is demonstrating top-tier performance in benchmarks, ranking first in math and second in coding according to independent tests. The model outperforms competitors across various metrics, establishing its position as one of the leading AI models globally.

Grok-4 benchmark results: Tops math, ranks second in coding

Elon Musk's AI company, xAI, has released its latest model, Grok-4, which has demonstrated impressive performance across a range of academic and problem-solving benchmarks, positioning it as a formidable competitor to established models from OpenAI, Google, and Anthropic. According to independent benchmarks from LMArena.ai, an open platform for crowdsourced AI evaluation, Grok-4 scores top-3 across all categories, securing the #1 spot in Math and #2 in Coding.

These results highlight Grok-4's significant advancements, particularly in mathematical reasoning. The Grok-4 Heavy version achieved a perfect 100% score on the AIME 2025 benchmark. Furthermore, Grok-4 Heavy also led the pack on the USAMO 2025 benchmark with a score of 61.9 percent. These performances place Grok-4 ahead of competitors like OpenAI's o3, Google's Gemini 2.5 Pro, and Anthropic's Claude 4 Opus on challenging math tests.

In the coding domain, Grok-4 is also highly competitive. It outperformed competing models on LiveCodeBench, a coding benchmark. However, Gemini 2.5 Pro and Claude are still considered the best models for coding, though this might change with the expected release of Grok-4 Code in August, which will be optimized for coding tasks. In tests evaluating writing and editing code, Grok-4 Heavy placed fourth with 79.6% of tasks solved correctly.

The model has also set new records on benchmarks that measure abstract reasoning. On the ARC-AGI-2 benchmark, Grok-4 achieved an unprecedented 15.9-16.2% score, nearly doubling the previous commercial state-of-the-art. This benchmark is particularly significant as it tests abstract reasoning and pattern recognition, critical indicators of progress toward artificial general intelligence. Grok-4 features a 256K-token context window and natively incorporates live data from “X” and other online sources.

xAI is offering Grok-4 via an API, with pricing remaining competitive. The company also offers a premium “Heavy” tier for $300 per seat each month, which uses five parallel Grok-4 agents to solve the toughest jobs. While Grok-4's benchmark performance is impressive, some experts caution that these metrics don't fully capture real-world performance and that models like GPT-4o and Claude Opus 4 remain highly reliable for enterprise workflows.

Grok-4 xAI benchmark AI coding