Elon Musk's AI company, xAI, has released its latest model, Grok-4, which has demonstrated impressive performance across a range of academic and problem-solving benchmarks, positioning it as a formidable competitor to established models from OpenAI, Google, and Anthropic. According to independent benchmarks from LMArena.ai, an open platform for crowdsourced AI evaluation, Grok-4 scores top-3 across all categories, securing the #1 spot in Math and #2 in Coding.
These results highlight Grok-4's significant advancements, particularly in mathematical reasoning. The Grok-4 Heavy version achieved a perfect 100% score on the AIME 2025 benchmark. Furthermore, Grok-4 Heavy also led the pack on the USAMO 2025 benchmark with a score of 61.9 percent. These performances place Grok-4 ahead of competitors like OpenAI's o3, Google's Gemini 2.5 Pro, and Anthropic's Claude 4 Opus on challenging math tests.
In the coding domain, Grok-4 is also highly competitive. It outperformed competing models on LiveCodeBench, a coding benchmark. However, Gemini 2.5 Pro and Claude are still considered the best models for coding, though this might change with the expected release of Grok-4 Code in August, which will be optimized for coding tasks. In tests evaluating writing and editing code, Grok-4 Heavy placed fourth with 79.6% of tasks solved correctly.
The model has also set new records on benchmarks that measure abstract reasoning. On the ARC-AGI-2 benchmark, Grok-4 achieved an unprecedented 15.9-16.2% score, nearly doubling the previous commercial state-of-the-art. This benchmark is particularly significant as it tests abstract reasoning and pattern recognition, critical indicators of progress toward artificial general intelligence. Grok-4 features a 256K-token context window and natively incorporates live data from “X” and other online sources.
xAI is offering Grok-4 via an API, with pricing remaining competitive. The company also offers a premium “Heavy” tier for $300 per seat each month, which uses five parallel Grok-4 agents to solve the toughest jobs. While Grok-4's benchmark performance is impressive, some experts caution that these metrics don't fully capture real-world performance and that models like GPT-4o and Claude Opus 4 remain highly reliable for enterprise workflows.