AI Model Benchmarks
Comprehensive performance comparison across major AI labs
Last updated: February 17, 2026 · Click column headers to sort · Data sourced from official model cards and benchmark publications
| Model ▲ | ReasoningGPQA Diamond | KnowledgeMMLU | KnowledgeMMMLU | CodingHumanEval | CodingSWE-bench Ve… | MathAIME 2024 | MathMATH-500 | ReasoningHLE | ReasoningARC-AGI-2 | CodingLiveCodeBench | MultimodalMMMU | KnowledgeSimpleQA | Instruction FollowingIFEval | ReasoningARC-AGI-1 | AgentVending-Benc… | AgentAlpha Arena S1 | AgentAlpha Arena … | AgentProphet Arena | EQEQ-Bench 3 (… |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Claude 3.7 Sonnet Anthropic2025-02 | 68 | 88 | — | 93.7 | 62.3 | — | 80 | — | — | — | — | — | 86.5 | 32 | — | — | — | — | 1083.7 |
Claude Opus 4.1 Anthropic2025-04 | 79.2 | — | 89.5 | 92 | 72.8 | — | 94.8 | — | — | — | — | — | 91.4 | 60 | — | — | — | — | 1407.9 |
Claude Opus 4.5 Anthropic2025-10 | 87 | — | 90.8 | — | 80.9 | 100 | 97.5 | 32 | 37.6 | — | 78.5 | — | 93.1 | 76 | $4,967 | — | — | 0.8036 | 1683.1 |
Claude Opus 4.6 Anthropic2026-02 | 74.5 | — | — | — | 80.8 | — | 97.6 | 26.3 | 75.2 | — | — | — | — | 85 | $8,018 | — | — | — | 1961 |
Claude Sonnet 4.5 Anthropic2025-10 | 83.4 | — | 89.1 | — | 82 | 87 | 96.2 | 19.8 | — | 64 | 75.4 | — | 92.8 | 64 | $3,839 | -30.8% | -35.1% | 0.7966 | 1501.4 |
Claude Sonnet 4.6 Anthropic2026-02 | 74.1 | 79.1 | — | — | 79.6 | — | 97.8 | 19.1 | 58.3 | — | — | — | — | — | $5,700 | — | — | — | — |
DeepSeek V3 DeepSeek2024-12 | 59.1 | 88.5 | — | 82.6 | 42 | — | 90.2 | — | — | — | — | 24.9 | 87.5 | — | — | +4.9% | -29.2% | — | 1048.7 |
DeepSeek-R1 DeepSeek2025-01 | 71.5 | 90.8 | — | — | 49.2 | 79.8 | 97.3 | — | — | 65.9 | — | 30.1 | 83.3 | 45 | — | — | — | 0.7662 | 1185.6 |
Gemini 2.5 Flash Google2025-04 | 70.5 | — | 85.1 | — | 49.6 | 83 | 91.2 | — | — | — | 68 | — | 87.5 | 42 | — | — | — | 0.8090 | 1074.2 |
Gemini 2.5 Pro Google2025-03 | 84 | — | 89.2 | — | 63.8 | 92 | 95.2 | 21.6 | — | 70.4 | 75.8 | — | 89.5 | 63 | — | -56.7% | — | 0.8121 | 1347.8 |
Gemini 3 Flash Google2025-12 | 82.1 | — | 87.5 | — | 62 | 90 | 93.5 | — | — | — | 74.2 | — | 89 | 55 | $3,635 | — | — | — | — |
Gemini 3 Pro Google2025-11 | 91.9 | — | 91.8 | — | 76.2 | 100 | 97 | 45.8 | 31.1 | 79.5 | 81 | — | 91.5 | 80 | $5,478 | — | -25.7% | 0.8104 | 1629.5 |
GPT-4.1 OpenAI2025-04 | 66.3 | 90.2 | — | 91.5 | 54.6 | — | 90.2 | — | — | — | — | — | 88.4 | 52 | — | — | — | — | 1137.7 |
GPT-4o OpenAI2024-05 | 53.6 | 88.7 | — | 90.2 | 38.4 | — | 76.6 | — | — | — | 69.1 | 38.2 | 86.1 | 21 | — | — | — | — | 1321.9 |
GPT-5 OpenAI2025-08 | 88.4 | — | — | — | 74.9 | 94.6 | 97.3 | 35.2 | 18 | — | 84.2 | — | 92.5 | 72 | — | -62.7% | — | — | 1456.9 |
GPT-5.1 OpenAI2025-08 | 88.1 | — | 91 | — | 76.3 | 100 | 97.8 | — | 17.6 | — | 85.4 | — | — | 75 | $1,473 | — | -2.3% | 0.8113 | 1727.6 |
GPT-5.2 OpenAI2025-12 | 92.4 | — | 91 | — | 80 | 100 | 98.5 | — | 52.9 | — | 86.5 | — | — | 82 | $3,591 | — | — | 0.8093 | 1637 |
Grok 3 xAI2025-02 | 68.2 | 88.5 | — | 89.3 | 48.5 | 86.7 | 93 | — | — | — | — | — | — | 34 | — | — | — | — | 1180.6 |
Grok 4 xAI2025-07 | 87.5 | — | 86.6 | — | 75 | 95 | 97 | 25.4 | — | — | 76.5 | — | — | 71 | — | -45.3% | -53.4% | 0.8226 | 1131.7 |
Grok 4.20 xAI2026-02 | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | +12.1% | — | — |
Kimi K2 Thinking Moonshot2025-07 | 84.5 | — | — | — | 71.3 | 99.1 | 97 | 44.9 | — | 83.1 | — | — | — | 68 | — | — | -29.9% | 0.8068 | 1622.5 |
Llama 4 Maverick Meta2025-04 | 69.8 | 88.4 | — | 85.5 | — | — | 86 | — | — | — | 73.4 | — | 90 | — | — | — | — | 0.7663 | 833.2 |
Llama 4 Scout Meta2025-04 | 57.2 | 85.8 | — | 81.2 | — | — | 79.6 | — | — | — | 69.4 | — | 87.6 | — | — | — | — | — | 626.8 |
o3 OpenAI2025-04 | 83.3 | — | — | — | 69.1 | 98.4 | 96.7 | — | — | — | 82.9 | — | 91.8 | 87.5 | — | — | — | — | 1500 |
o4-mini OpenAI2025-04 | 81.4 | — | — | — | 68.1 | 99.5 | 96.3 | — | — | — | 79.6 | — | 90.2 | 72 | — | — | — | — | 1210.6 |
Qwen 3 Alibaba2025-04 | 71.1 | 89.5 | — | 88.4 | — | 87.5 | 95 | — | — | 62.5 | — | — | 88.2 | — | — | +22.3% | -31.9% | 0.7790 | 1167.9 |
Benchmark Descriptions
Graduate-level science questions across biology, physics, chemistry — PhD-level difficulty
Massive Multitask Language Understanding — 57 subjects from elementary to professional level
Multilingual MMLU across 14 languages — broad topic coverage in multiple languages
Python function completion — 164 hand-written programming problems
Resolving real GitHub issues — tests agentic software engineering ability
American Invitational Mathematics Examination — competition-level math
Competition-level math problems — 500 problems across difficulty levels
Humanity's Last Exam — the hardest multi-domain benchmark designed to push AI limits
Abstract reasoning and adaptability — visual puzzles resisting memorization
Live competitive programming problems — real-time coding challenge evaluation
Massive Multi-discipline Multimodal Understanding — images, diagrams, charts
Short-form factuality benchmark — tests factual accuracy on simple questions
Instruction following evaluation — measures adherence to complex instructions
Original ARC abstract reasoning benchmark — visual pattern puzzles testing fluid intelligence
Long-term coherence benchmark — AI manages a simulated vending machine business for a year, scored on profit
Season 1 (Oct–Nov 2025) — AI models trade $10K in crypto perpetuals on Hyperliquid, scored on % returns
Season 1.5 (Nov–Dec 2025) — AI models trade $10K in US stocks (TSLA, NVDA, MSFT, AMZN, QQQ), scored on % returns
Live forecasting benchmark by SIGMA Lab (UChicago) — AI predicts real-world events on Kalshi prediction markets, scored by Brier accuracy (1 − Brier, higher is better)
Emotional intelligence benchmark scored by Elo from pairwise LLM-judge comparisons (EQ-Bench 3).
Labs
Data sourced from official model cards, blog posts, arXiv papers, and independent evaluations.
Some scores are self-reported by labs and may use different evaluation settings.