AI Model Benchmarks
Comprehensive performance comparison across major AI labs
Last updated: April 24, 2026 · Click column headers to sort · Data sourced from official model cards and benchmark publications
Compare models
Pick a few models to compare side by side, then hide benchmarks where none of those models have scores.
| Model ▲ | ReasoningGPQA Diamond | KnowledgeMMLU | KnowledgeMMMLU | CodingHumanEval | CodingSWE-bench Ve… | MathAIME 2024 | MathMATH-500 | ReasoningHLE | ReasoningARC-AGI-2 | CodingLiveCodeBench | MultimodalMMMU | KnowledgeSimpleQA | Instruction FollowingIFEval | ReasoningARC-AGI-1 | AgentVending-Benc… | AgentAlpha Arena S1 | AgentAlpha Arena … | AgentProphet Arena | EQEQ-Bench 3 (… | AgentBrowseComp | AgentOSWorld | AgentTerminal-Ben… | Agentτ2-bench Ret… | AgentMCP Atlas | MultimodalMMMU Pro | AgentAPEX-Agents | Agentτ2-bench Tel… | AgentPinchBench | CodingSWE-Pro | CodingVIBE-Pro | AgentMLE-Bench Lite | AgentGDPval-AA | CodingSWE-Multilin… | AgentGDPval | CodingExpert-SWE | AgentToolathlon | MathFrontierMath… | MathFrontierMath… | AgentCyberGym | AgentFinanceAgent | KnowledgeOfficeQA Pro | MathUSAMO 2026 | ReasoningGraphWalks BFS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Claude 3.7 Sonnet Anthropic2025-02 | 68 | 88 | - | 93.7 | 62.3 | - | 80 | - | - | - | - | - | 86.5 | 32 | - | - | - | - | 1083.7 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Claude Mythos Preview Anthropic2026-04 | 94.5 | - | 92.7 | - | 93.9 | - | - | 56.8 | - | - | - | - | - | - | - | - | - | - | - | - | 79.6 | 82 | - | - | - | - | - | - | 77.8 | - | - | - | 87.3 | - | - | - | - | - | - | - | - | +97.6% | +80% |
Claude Opus 4.1 Anthropic2025-04 | 79.2 | - | 89.5 | 92 | 72.8 | - | 94.8 | - | - | - | - | - | 91.4 | 60 | - | - | - | - | 1407.9 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Claude Opus 4.5 Anthropic2025-10 | 87 | - | 90.8 | - | 80.9 | 100 | 97.5 | 32 | 37.6 | - | 78.5 | - | 93.1 | 76 | $4,967 | - | - | 0.8036 | 1683.1 | 67.8 | 66.3 | 59.8 | 88.9 | 62.3 | 70.6 | - | 98.2 | 88.9 | - | - | - | 1416 | - | - | - | - | - | - | - | - | - | - | - |
Claude Opus 4.6 Anthropic2026-02 | 91.3 | - | - | - | 80.8 | - | 97.6 | 40 | 75.2 | - | - | - | - | 85 | $8,018 | - | - | - | 1961 | 84 | 72.7 | 65.4 | 91.9 | 59.5 | 73.9 | 33.5 | 99.3 | 90.6 | - | - | - | 1606 | - | - | - | - | - | - | - | - | - | +42.3% | +38.7% |
Claude Opus 4.7 Anthropic2026-04 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 79.3 | 78 | 69.4 | - | - | - | - | - | - | - | - | - | - | - | +80.3% | - | - | 43.8 | 22.9 | 73.1 | - | - | - | - |
Claude Sonnet 4.5 Anthropic2025-10 | 83.4 | - | 89.1 | - | 82 | 87 | 96.2 | 19.8 | - | 64 | 75.4 | - | 92.8 | 64 | $3,839 | -30.8% | -35.1% | 0.7966 | 1501.4 | 43.9 | 61.4 | 51 | 86.2 | 43.8 | 63.4 | - | - | 92.7 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Claude Sonnet 4.6 Anthropic2026-02 | 74.1 | 79.1 | - | - | 79.6 | - | 97.8 | 19.1 | 58.3 | - | - | - | - | - | $5,700 | - | - | - | - | 74.7 | - | 59.1 | - | - | - | - | - | - | - | - | - | 1553 | - | - | - | - | - | - | - | - | - | - | - |
DeepSeek V3 DeepSeek2024-12 | 59.1 | 88.5 | - | 82.6 | 42 | - | 90.2 | - | - | - | - | 24.9 | 87.5 | - | - | +4.9% | -29.2% | - | 1048.7 | - | - | - | - | - | - | - | - | 82.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
DeepSeek-R1 DeepSeek2025-01 | 71.5 | 90.8 | - | - | 49.2 | 79.8 | 97.3 | - | - | 65.9 | - | 30.1 | 83.3 | 45 | - | - | - | 0.7662 | 1185.6 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Gemini 2.5 Flash Google2025-04 | 70.5 | - | 85.1 | - | 49.6 | 83 | 91.2 | - | - | - | 68 | - | 87.5 | 42 | - | - | - | 0.8090 | 1074.2 | - | - | - | - | - | - | - | - | 76.6 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Gemini 2.5 Pro Google2025-03 | 84 | - | 89.2 | - | 63.8 | 92 | 95.2 | 21.6 | - | 70.4 | 75.8 | - | 89.5 | 63 | - | -56.7% | - | 0.8121 | 1347.8 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Gemini 3 Flash Google2025-12 | 82.1 | - | 87.5 | - | 62 | 90 | 93.5 | - | - | - | 74.2 | - | 89 | 55 | $3,635 | - | - | - | - | - | - | - | - | - | - | - | - | 95.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Gemini 3 Pro Google2025-11 | 91.9 | - | 91.8 | - | 76.2 | 100 | 97 | 45.8 | 31.1 | 79.5 | 81 | - | 91.5 | 80 | $5,478 | - | -25.7% | 0.8104 | 1629.5 | 59.2 | - | 56.2 | 85.3 | 54.1 | 81 | 18.4 | - | 91.7 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Gemini 3.1 Pro Google2026-02 | 94.3 | - | 93.1 | - | 80.6 | - | - | 44.4 | 77.1 | - | - | - | - | - | - | - | - | - | - | 85.9 | - | 68.5 | - | - | 80.5 | 33.5 | - | - | 54.2 | - | - | - | - | +67.3% | - | 48.8 | 36.9 | 16.7 | - | - | - | +74.4% | - |
Gemini Deep Think Google2026-02 | 93.8 | - | - | - | - | - | - | 48.4 | 84.6 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
GPT-4.1 OpenAI2025-04 | 66.3 | 90.2 | - | 91.5 | 54.6 | - | 90.2 | - | - | - | - | - | 88.4 | 52 | - | - | - | - | 1137.7 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
GPT-4o OpenAI2024-05 | 53.6 | 88.7 | - | 90.2 | 38.4 | - | 76.6 | - | - | - | 69.1 | 38.2 | 86.1 | 21 | - | - | - | - | 1321.9 | - | - | - | - | - | - | - | - | 85.2 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
GPT-5 OpenAI2025-08 | 88.4 | - | - | - | 74.9 | 94.6 | 97.3 | 35.2 | 18 | - | 84.2 | - | 92.5 | 72 | - | -62.7% | - | - | 1456.9 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
GPT-5.1 OpenAI2025-08 | 88.1 | - | 91 | - | 76.3 | 100 | 97.8 | - | 17.6 | - | 85.4 | - | - | 75 | $1,473 | - | -2.3% | 0.8113 | 1727.6 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
GPT-5.2 OpenAI2025-12 | 92.4 | - | 91 | - | 80 | 100 | 98.5 | - | 52.9 | - | 86.5 | - | - | 82 | $3,591 | - | - | 0.8093 | 1637 | 77.9 | - | 64.7 | 82 | 60.6 | 79.5 | - | 98.7 | 65.6 | - | - | - | 1462 | - | - | - | - | - | - | - | - | - | - | - |
GPT-5.4 OpenAI2026-03 | 92.8 | 91 | 91 | 85.1 | 52.8 | 100 | 88.6 | 42 | 52.9 | - | 84.2 | 56.7 | 92 | - | - | - | - | - | - | 82.7 | 75 | 75.1 | - | - | 61 | - | - | - | - | - | - | 1667 | - | +83% | - | 54.6 | 47.6 | 27.1 | 79 | - | - | +95.2% | +21.4% |
GPT-5.4 Pro OpenAI2026-03 | 84.2 | 91 | - | - | 57.7 | 100 | - | - | 52.9 | - | - | - | - | - | - | - | - | - | - | 89.3 | 75 | 55 | - | - | - | - | - | - | - | - | - | - | - | +82% | - | - | 50 | 38 | - | - | - | - | - |
GPT-5.5 OpenAI2026-04 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 84.4 | 78.7 | 82.7 | - | - | - | - | 98 | - | 58.6 | - | - | - | - | +84.9% | 73.1 | 55.6 | 51.7 | 35.4 | 81.8 | 60 | 54.1 | - | - |
GPT-5.5 Pro OpenAI2026-04 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 90.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | +82.3% | - | - | 52.4 | 39.6 | - | - | - | - | - |
Grok 3 xAI2025-02 | 68.2 | 88.5 | - | 89.3 | 48.5 | 86.7 | 93 | - | - | - | - | - | - | 34 | - | - | - | - | 1180.6 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Grok 4 xAI2025-07 | 87.5 | - | 86.6 | - | 75 | 95 | 97 | 25.4 | - | - | 76.5 | - | - | 71 | - | -45.3% | -53.4% | 0.8226 | 1131.7 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Grok 4.20 xAI2026-02 | - | - | - | - | 73.5 | - | - | - | - | - | - | - | 82.9 | - | - | - | +12.1% | - | - | - | - | - | - | - | - | - | - | - | 56.2 | - | +66.6% | - | - | - | - | - | - | - | - | - | - | - | - |
Kimi K2 Thinking Moonshot2025-07 | 84.5 | - | - | - | 71.3 | 99.1 | 97 | 44.9 | - | 83.1 | - | - | - | 68 | - | - | -29.9% | 0.8068 | 1622.5 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Llama 4 Maverick Meta2025-04 | 69.8 | 88.4 | - | 85.5 | - | - | 86 | - | - | - | 73.4 | - | 90 | - | - | - | - | 0.7663 | 833.2 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Llama 4 Scout Meta2025-04 | 57.2 | 85.8 | - | 81.2 | - | - | 79.6 | - | - | - | 69.4 | - | 87.6 | - | - | - | - | - | 626.8 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
MiniMax M2.5 MiniMax2026-01 | - | - | - | - | 63 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 48 | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
MiniMax M2.7 MiniMax2026-03 | - | - | - | - | 72.5 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 57 | - | - | - | - | - | - | 56.2 | 55.6 | +66.6% | 1491 | 76.5 | - | - | - | - | - | - | - | - | - | - |
o3 OpenAI2025-04 | 83.3 | - | - | - | 69.1 | 98.4 | 96.7 | - | - | - | 82.9 | - | 91.8 | 87.5 | - | - | - | - | 1500 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
o4-mini OpenAI2025-04 | 81.4 | - | - | - | 68.1 | 99.5 | 96.3 | - | - | - | 79.6 | - | 90.2 | 72 | - | - | - | - | 1210.6 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Qwen 3 Alibaba2025-04 | 71.1 | 89.5 | - | 88.4 | - | 87.5 | 95 | - | - | 62.5 | - | - | 88.2 | - | - | +22.3% | -31.9% | 0.7790 | 1167.9 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - |
Benchmark Descriptions
Graduate-level science questions across biology, physics, chemistry — PhD-level difficulty
Massive Multitask Language Understanding — 57 subjects from elementary to professional level
Multilingual MMLU across 14 languages — broad topic coverage in multiple languages
Python function completion — 164 hand-written programming problems
Resolving real GitHub issues — tests agentic software engineering ability
American Invitational Mathematics Examination — competition-level math
Competition-level math problems — 500 problems across difficulty levels
Humanity's Last Exam — the hardest multi-domain benchmark designed to push AI limits
Abstract reasoning and adaptability — visual puzzles resisting memorization
Live competitive programming problems — real-time coding challenge evaluation
Massive Multi-discipline Multimodal Understanding — images, diagrams, charts
Short-form factuality benchmark — tests factual accuracy on simple questions
Instruction following evaluation — measures adherence to complex instructions
Original ARC abstract reasoning benchmark — visual pattern puzzles testing fluid intelligence
Long-term coherence benchmark — AI manages a simulated vending machine business for a year, scored on profit
Season 1 (Oct–Nov 2025) — AI models trade $10K in crypto perpetuals on Hyperliquid, scored on % returns
Season 1.5 (Nov–Dec 2025) — AI models trade $10K in US stocks (TSLA, NVDA, MSFT, AMZN, QQQ), scored on % returns
Live forecasting benchmark by SIGMA Lab (UChicago) — AI predicts real-world events on Kalshi prediction markets, scored by Brier accuracy (1 − Brier, higher is better)
Emotional intelligence benchmark scored by Elo from pairwise LLM-judge comparisons (EQ-Bench 3).
Agentic web search — multi-step browsing, research, and information extraction from real websites
Computer use benchmark — GUI interaction, desktop automation, real OS tasks (369 tasks)
Agentic terminal coding — command-line navigation, shell operations, development tasks
Agentic tool use — multi-step planning and function invocation in consumer retail scenarios
Scaled tool use — coordinating many tools simultaneously in complex agent workflows
Multimodal understanding — complex visual reasoning across academic disciplines (no tools)
Long-horizon professional agentic tasks — sustained multi-step autonomous work
Multi-step telecom customer service tasks — tests tool use and policy compliance
Agentic task success rate — 23 real-world tasks including calendar, research, email, file management, and multi-step workflows
Real-world software engineering across multiple languages including Python, JS, Go, Rust — tests full-stack engineering depth
Repo-level code generation — end-to-end full project delivery across Web, Android, iOS, and simulation tasks
Autonomous ML competition tasks (OpenAI) — AI competes in 22 Kaggle-style ML competitions, scored by medal rate
Professional office & domain expertise — ELO-scored evaluation of task delivery across real-world work scenarios
Software engineering across multiple programming languages — tests breadth beyond Python-only benchmarks
Knowledge work benchmark - model wins or ties across well-specified real-world office tasks and occupations
Internal frontier software engineering eval - long-horizon coding tasks with median human completion time around 20 hours
Tool-use benchmark - multi-step tool calling and workflow execution across realistic tasks
Frontier math benchmark - harder mathematical reasoning across Tier 1 to Tier 3 problems
Frontier math benchmark - the hardest Tier 4 problems from FrontierMath
Cybersecurity benchmark - vulnerability discovery, patching, and cyber defense task performance
Finance knowledge-work benchmark - complex spreadsheet, modeling, and finance task execution
Document-heavy office benchmark - professional question answering across business documents and workflows
Proof-based olympiad math benchmark using 2026 USAMO problems - tests long-form mathematical reasoning at elite competition level
Long-context graph traversal benchmark over 256K-1M context windows - tests structured reasoning across very long contexts
Labs
Data sourced from official model cards, blog posts, arXiv papers, and independent evaluations.
Some scores are self-reported by labs and may use different evaluation settings.