AI Model Benchmarks

Comprehensive performance comparison across major AI labs

Last updated: March 7, 2026 · Click column headers to sort · Data sourced from official model cards and benchmark publications

LAB

CATEGORY

30 models28 benchmarksSorted by: Model Name ▲

Model ▲	ReasoningGPQA Diamond	KnowledgeMMLU	KnowledgeMMMLU	CodingHumanEval	CodingSWE-bench Ve…	MathAIME 2024	MathMATH-500	ReasoningHLE	ReasoningARC-AGI-2	CodingLiveCodeBench	MultimodalMMMU	KnowledgeSimpleQA	Instruction FollowingIFEval	ReasoningARC-AGI-1	AgentVending-Benc…	AgentAlpha Arena S1	AgentAlpha Arena …	AgentProphet Arena	EQEQ-Bench 3 (…	AgentBrowseComp	AgentOSWorld	AgentTerminal-Ben…	Agentτ2-bench Ret…	AgentMCP Atlas	MultimodalMMMU Pro	AgentAPEX-Agents	Agentτ2-bench Tel…	AgentPinchBench
Claude 3.7 Sonnet Anthropic2025-02	68	88	—	93.7	62.3	—	80	—	—	—	—	—	86.5	32	—	—	—	—	1083.7	—	—	—	—	—	—	—	—	—
Claude Opus 4.1 Anthropic2025-04	79.2	—	89.5	92	72.8	—	94.8	—	—	—	—	—	91.4	60	—	—	—	—	1407.9	—	—	—	—	—	—	—	—	—
Claude Opus 4.5 Anthropic2025-10	87	—	90.8	—	80.9	100	97.5	32	37.6	—	78.5	—	93.1	76	$4,967	—	—	0.8036	1683.1	67.8	66.3	59.8	88.9	62.3	70.6	—	98.2	88.9
Claude Opus 4.6 Anthropic2026-02	91.3	—	—	—	80.8	—	97.6	40	75.2	—	—	—	—	85	$8,018	—	—	—	1961	84	72.7	65.4	91.9	59.5	73.9	33.5	99.3	90.6
Claude Sonnet 4.5 Anthropic2025-10	83.4	—	89.1	—	82	87	96.2	19.8	—	64	75.4	—	92.8	64	$3,839	-30.8%	-35.1%	0.7966	1501.4	43.9	61.4	51	86.2	43.8	63.4	—	—	92.7
Claude Sonnet 4.6 Anthropic2026-02	74.1	79.1	—	—	79.6	—	97.8	19.1	58.3	—	—	—	—	—	$5,700	—	—	—	—	74.7	—	59.1	—	—	—	—	—	—
DeepSeek V3 DeepSeek2024-12	59.1	88.5	—	82.6	42	—	90.2	—	—	—	—	24.9	87.5	—	—	+4.9%	-29.2%	—	1048.7	—	—	—	—	—	—	—	—	82.1
DeepSeek-R1 DeepSeek2025-01	71.5	90.8	—	—	49.2	79.8	97.3	—	—	65.9	—	30.1	83.3	45	—	—	—	0.7662	1185.6	—	—	—	—	—	—	—	—	—
Gemini 2.5 Flash Google2025-04	70.5	—	85.1	—	49.6	83	91.2	—	—	—	68	—	87.5	42	—	—	—	0.8090	1074.2	—	—	—	—	—	—	—	—	76.6
Gemini 2.5 Pro Google2025-03	84	—	89.2	—	63.8	92	95.2	21.6	—	70.4	75.8	—	89.5	63	—	-56.7%	—	0.8121	1347.8	—	—	—	—	—	—	—	—	—
Gemini 3 Flash Google2025-12	82.1	—	87.5	—	62	90	93.5	—	—	—	74.2	—	89	55	$3,635	—	—	—	—	—	—	—	—	—	—	—	—	95.1
Gemini 3 Pro Google2025-11	91.9	—	91.8	—	76.2	100	97	45.8	31.1	79.5	81	—	91.5	80	$5,478	—	-25.7%	0.8104	1629.5	59.2	—	56.2	85.3	54.1	81	18.4	—	91.7
Gemini 3.1 Pro Google2026-02	94.3	—	92.6	—	80.6	—	—	44.4	77.1	—	—	—	—	—	—	—	—	—	—	85.9	—	68.5	—	—	80.5	33.5	—	—
Gemini Deep Think Google2026-02	93.8	—	—	—	—	—	—	48.4	84.6	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
GPT-4.1 OpenAI2025-04	66.3	90.2	—	91.5	54.6	—	90.2	—	—	—	—	—	88.4	52	—	—	—	—	1137.7	—	—	—	—	—	—	—	—	—
GPT-4o OpenAI2024-05	53.6	88.7	—	90.2	38.4	—	76.6	—	—	—	69.1	38.2	86.1	21	—	—	—	—	1321.9	—	—	—	—	—	—	—	—	85.2
GPT-5 OpenAI2025-08	88.4	—	—	—	74.9	94.6	97.3	35.2	18	—	84.2	—	92.5	72	—	-62.7%	—	—	1456.9	—	—	—	—	—	—	—	—	—
GPT-5.1 OpenAI2025-08	88.1	—	91	—	76.3	100	97.8	—	17.6	—	85.4	—	—	75	$1,473	—	-2.3%	0.8113	1727.6	—	—	—	—	—	—	—	—	—
GPT-5.2 OpenAI2025-12	92.4	—	91	—	80	100	98.5	—	52.9	—	86.5	—	—	82	$3,591	—	—	0.8093	1637	77.9	—	64.7	82	60.6	79.5	—	98.7	65.6
GPT-5.4 OpenAI2026-03	84.2	91	91	85.1	52.8	100	88.6	42	52.9	—	84.2	56.7	92	—	—	—	—	—	—	—	75	55	—	—	61	—	—	—
GPT-5.4 Pro OpenAI2026-03	84.2	91	—	—	57.7	100	—	—	52.9	—	—	—	—	—	—	—	—	—	—	—	75	55	—	—	—	—	—	—
Grok 3 xAI2025-02	68.2	88.5	—	89.3	48.5	86.7	93	—	—	—	—	—	—	34	—	—	—	—	1180.6	—	—	—	—	—	—	—	—	—
Grok 4 xAI2025-07	87.5	—	86.6	—	75	95	97	25.4	—	—	76.5	—	—	71	—	-45.3%	-53.4%	0.8226	1131.7	—	—	—	—	—	—	—	—	—
Grok 4.20 xAI2026-02	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	+12.1%	—	—	—	—	—	—	—	—	—	—	—
Kimi K2 Thinking Moonshot2025-07	84.5	—	—	—	71.3	99.1	97	44.9	—	83.1	—	—	—	68	—	—	-29.9%	0.8068	1622.5	—	—	—	—	—	—	—	—	—
Llama 4 Maverick Meta2025-04	69.8	88.4	—	85.5	—	—	86	—	—	—	73.4	—	90	—	—	—	—	0.7663	833.2	—	—	—	—	—	—	—	—	—
Llama 4 Scout Meta2025-04	57.2	85.8	—	81.2	—	—	79.6	—	—	—	69.4	—	87.6	—	—	—	—	—	626.8	—	—	—	—	—	—	—	—	—
o3 OpenAI2025-04	83.3	—	—	—	69.1	98.4	96.7	—	—	—	82.9	—	91.8	87.5	—	—	—	—	1500	—	—	—	—	—	—	—	—	—
o4-mini OpenAI2025-04	81.4	—	—	—	68.1	99.5	96.3	—	—	—	79.6	—	90.2	72	—	—	—	—	1210.6	—	—	—	—	—	—	—	—	—
Qwen 3 Alibaba2025-04	71.1	89.5	—	88.4	—	87.5	95	—	—	62.5	—	—	88.2	—	—	+22.3%	-31.9%	0.7790	1167.9	—	—	—	—	—	—	—	—	—

Top tier (≥95% of max)

Strong (≥85%)

Mid-range (≥75%)

Below avg (≥60%)

—No data available

Benchmark Descriptions

GPQA DiamondReasoning

Graduate-level science questions across biology, physics, chemistry — PhD-level difficulty

MMLUKnowledge

Massive Multitask Language Understanding — 57 subjects from elementary to professional level

MMMLUKnowledge

Multilingual MMLU across 14 languages — broad topic coverage in multiple languages

HumanEvalCoding

Python function completion — 164 hand-written programming problems

SWE-bench VerifiedCoding

Resolving real GitHub issues — tests agentic software engineering ability

AIME 2024Math

American Invitational Mathematics Examination — competition-level math

MATH-500Math

Competition-level math problems — 500 problems across difficulty levels

HLEReasoning

Humanity's Last Exam — the hardest multi-domain benchmark designed to push AI limits

ARC-AGI-2Reasoning

Abstract reasoning and adaptability — visual puzzles resisting memorization

LiveCodeBenchCoding

Live competitive programming problems — real-time coding challenge evaluation

MMMUMultimodal

Massive Multi-discipline Multimodal Understanding — images, diagrams, charts

SimpleQAKnowledge

Short-form factuality benchmark — tests factual accuracy on simple questions

IFEvalInstruction Following

Instruction following evaluation — measures adherence to complex instructions

ARC-AGI-1Reasoning

Original ARC abstract reasoning benchmark — visual pattern puzzles testing fluid intelligence

Vending-Bench 2Agent

Long-term coherence benchmark — AI manages a simulated vending machine business for a year, scored on profit

Alpha Arena S1Agent

Season 1 (Oct–Nov 2025) — AI models trade $10K in crypto perpetuals on Hyperliquid, scored on % returns

Alpha Arena S1.5Agent

Season 1.5 (Nov–Dec 2025) — AI models trade $10K in US stocks (TSLA, NVDA, MSFT, AMZN, QQQ), scored on % returns

Prophet ArenaAgent

Live forecasting benchmark by SIGMA Lab (UChicago) — AI predicts real-world events on Kalshi prediction markets, scored by Brier accuracy (1 − Brier, higher is better)

EQ-Bench 3 (Elo)EQ

Emotional intelligence benchmark scored by Elo from pairwise LLM-judge comparisons (EQ-Bench 3).

BrowseCompAgent

Agentic web search — multi-step browsing, research, and information extraction from real websites

OSWorldAgent

Computer use benchmark — GUI interaction, desktop automation, real OS tasks (369 tasks)

Terminal-Bench 2.0Agent

Agentic terminal coding — command-line navigation, shell operations, development tasks

τ2-bench RetailAgent

Agentic tool use — multi-step planning and function invocation in consumer retail scenarios

MCP AtlasAgent

Scaled tool use — coordinating many tools simultaneously in complex agent workflows

MMMU ProMultimodal

Multimodal understanding — complex visual reasoning across academic disciplines (no tools)

APEX-AgentsAgent

Long-horizon professional agentic tasks — sustained multi-step autonomous work

τ2-bench TelecomAgent

Multi-step telecom customer service tasks — tests tool use and policy compliance

PinchBenchAgent

Agentic task success rate — 23 real-world tasks including calendar, research, email, file management, and multi-step workflows

Labs

OpenAI(9 models)

Anthropic(6 models)

Google(6 models)

xAI(3 models)

Meta(2 models)

DeepSeek(2 models)

Moonshot(1 model)

Alibaba(1 model)

Data sourced from official model cards, blog posts, arXiv papers, and independent evaluations.

Some scores are self-reported by labs and may use different evaluation settings.

Model ▲

ReasoningGPQA Diamond

KnowledgeMMLU

KnowledgeMMMLU

CodingHumanEval

CodingSWE-bench Ve…

MathAIME 2024

MathMATH-500

ReasoningHLE

ReasoningARC-AGI-2

CodingLiveCodeBench

MultimodalMMMU

KnowledgeSimpleQA

Instruction FollowingIFEval

ReasoningARC-AGI-1

AgentVending-Benc…

AgentAlpha Arena S1

AgentAlpha Arena …

AgentProphet Arena

EQEQ-Bench 3 (…

AgentBrowseComp

AgentOSWorld

AgentTerminal-Ben…

Agentτ2-bench Ret…

AgentMCP Atlas

MultimodalMMMU Pro

AgentAPEX-Agents

Agentτ2-bench Tel…

AgentPinchBench

Claude 3.7 Sonnet

Anthropic2025-02

68

88

—

93.7

62.3

—

80

—

—

—

—

—

86.5

32

—

—

—

—

1083.7

—

—

—

—

—

—

—

—

—

Claude Opus 4.1

Anthropic2025-04

79.2

—

89.5

92

72.8

—

94.8

—

—

—

—

—

91.4

60

—

—

—

—

1407.9

—

—

—

—

—

—

—

—

—

Claude Opus 4.5

Anthropic2025-10

87

—

90.8

—

80.9

100

97.5

32

37.6

—

78.5

—

93.1

76

$4,967

—

—

0.8036

1683.1

67.8

66.3

59.8

88.9

62.3

70.6

—

98.2

88.9

Claude Opus 4.6

Anthropic2026-02

91.3

—

—

—

80.8

—

97.6

40

75.2

—

—

—

—

85

$8,018

—

—

—

1961

84

72.7

65.4

91.9

59.5

73.9

33.5

99.3

90.6

Claude Sonnet 4.5

Anthropic2025-10

83.4

—

89.1

—

82

87

96.2

19.8

—

64

75.4

—

92.8

64

$3,839

-30.8%

-35.1%

0.7966

1501.4

43.9

61.4

51

86.2

43.8

63.4

—

—

92.7

Claude Sonnet 4.6

Anthropic2026-02

74.1

79.1

—

—

79.6

—

97.8

19.1

58.3

—

—

—

—

—

$5,700

—

—

—

—

74.7

—

59.1

—

—

—

—

—

—

DeepSeek V3

DeepSeek2024-12

59.1

88.5

—

82.6

42

—

90.2

—

—

—

—

24.9

87.5

—

—

+4.9%

-29.2%

—

1048.7

—

—

—

—

—

—

—

—

82.1

DeepSeek-R1

DeepSeek2025-01

71.5

90.8

—

—

49.2

79.8

97.3

—

—

65.9

—

30.1

83.3

45

—

—

—

0.7662

1185.6

—

—

—

—

—

—

—

—

—

Gemini 2.5 Flash

Google2025-04

70.5

—

85.1

—

49.6

83

91.2

—

—

—

68

—

87.5

42

—

—

—

0.8090

1074.2

—

—

—

—

—

—

—

—

76.6

Gemini 2.5 Pro

Google2025-03

84

—

89.2

—

63.8

92

95.2

21.6

—

70.4

75.8

—

89.5

63

—

-56.7%

—

0.8121

1347.8

—

—

—

—

—

—

—

—

—

Gemini 3 Flash

Google2025-12

82.1

—

87.5

—

62

90

93.5

—

—

—

74.2

—

89

55

$3,635

—

—

—

—

—

—

—

—

—

—

—

—

95.1

Gemini 3 Pro

Google2025-11

91.9

—

91.8

—

76.2

100

97

45.8

31.1

79.5

81

—

91.5

80

$5,478

—

-25.7%

0.8104

1629.5

59.2

—

56.2

85.3

54.1

81

18.4

—

91.7

Gemini 3.1 Pro

Google2026-02

94.3

—

92.6

—

80.6

—

—

44.4

77.1

—

—

—

—

—

—

—

—

—

—

85.9

—

68.5

—

—

80.5

33.5

—

—

Gemini Deep Think

Google2026-02

93.8

—

—

—

—

—

—

48.4

84.6

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

GPT-4.1

OpenAI2025-04

66.3

90.2

—

91.5

54.6

—

90.2

—

—

—

—

—

88.4

52

—

—

—

—

1137.7

—

—

—

—

—

—

—

—

—

GPT-4o

OpenAI2024-05

53.6

88.7

—

90.2

38.4

—

76.6

—

—

—

69.1

38.2

86.1

21

—

—

—

—

1321.9

—

—

—

—

—

—

—

—

85.2

GPT-5

OpenAI2025-08

88.4

—

—

—

74.9

94.6

97.3

35.2

18

—

84.2

—

92.5

72

—

-62.7%

—

—

1456.9

—

—

—

—

—

—

—

—

—

GPT-5.1

OpenAI2025-08

88.1

—

91

—

76.3

100

97.8

—

17.6

—

85.4

—

—

75

$1,473

—

-2.3%

0.8113

1727.6

—

—

—

—

—

—

—

—

—

GPT-5.2

OpenAI2025-12

92.4

—

91

—

80

100

98.5

—

52.9

—

86.5

—

—

82

$3,591

—

—

0.8093

1637

77.9

—

64.7

82

60.6

79.5

—

98.7

65.6

GPT-5.4

OpenAI2026-03

84.2

91

91

85.1

52.8

100

88.6

42

52.9

—

84.2

56.7

92

—

—

—

—

—

—

—

75

55

—

—

61

—

—

—

GPT-5.4 Pro

OpenAI2026-03

84.2

91

—

—

57.7

100

—

—

52.9

—

—

—

—

—

—

—

—

—

—

—

75

55

—

—

—

—

—

—

Grok 3

xAI2025-02

68.2

88.5

—

89.3

48.5

86.7

93

—

—

—

—

—

—

34

—

—

—

—

1180.6

—

—

—

—

—

—

—

—

—

Grok 4

xAI2025-07

87.5

—

86.6

—

75

95

97

25.4

—

—

76.5

—

—

71

—

-45.3%

-53.4%

0.8226

1131.7

—

—

—

—

—

—

—

—

—

Grok 4.20

xAI2026-02

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

—

+12.1%

—

—

—

—

—

—

—

—

—

—

—

Kimi K2 Thinking

Moonshot2025-07

84.5

—

—

—

71.3

99.1

97

44.9

—

83.1

—

—

—

68

—

—

-29.9%

0.8068

1622.5

—

—

—

—

—

—

—

—

—

Llama 4 Maverick

Meta2025-04

69.8

88.4

—

85.5

—

—

86

—

—

—

73.4

—

90

—

—

—

—

0.7663

833.2

—

—

—

—

—

—

—

—

—

Llama 4 Scout

Meta2025-04

57.2

85.8

—

81.2

—

—

79.6

—

—

—

69.4

—

87.6

—

—

—

—

—

626.8

—

—

—

—

—

—

—

—

—

OpenAI2025-04

83.3

—

—

—

69.1

98.4

96.7

—

—

—

82.9

—

91.8

87.5

—

—

—

—

1500

—

—

—

—

—

—

—

—

—

o4-mini

OpenAI2025-04

81.4

—

—

—

68.1

99.5

96.3

—

—

—

79.6

—

90.2

72

—

—

—

—

1210.6

—

—

—

—

—

—

—

—

—

Qwen 3

Alibaba2025-04

71.1

89.5

—

88.4

—

87.5

95

—

—

62.5

—

—

88.2

—

—

+22.3%

-31.9%

0.7790

1167.9

—

—

—

—

—

—

—

—

—

Benchmark Descriptions

GPQA DiamondReasoning

Graduate-level science questions across biology, physics, chemistry — PhD-level difficulty

MMLUKnowledge

Massive Multitask Language Understanding — 57 subjects from elementary to professional level

MMMLUKnowledge

Multilingual MMLU across 14 languages — broad topic coverage in multiple languages

HumanEvalCoding

Python function completion — 164 hand-written programming problems

SWE-bench VerifiedCoding

Resolving real GitHub issues — tests agentic software engineering ability

AIME 2024Math

American Invitational Mathematics Examination — competition-level math

MATH-500Math

Competition-level math problems — 500 problems across difficulty levels

HLEReasoning

Humanity's Last Exam — the hardest multi-domain benchmark designed to push AI limits

ARC-AGI-2Reasoning

Abstract reasoning and adaptability — visual puzzles resisting memorization

LiveCodeBenchCoding

Live competitive programming problems — real-time coding challenge evaluation

MMMUMultimodal

Massive Multi-discipline Multimodal Understanding — images, diagrams, charts

SimpleQAKnowledge

Short-form factuality benchmark — tests factual accuracy on simple questions

IFEvalInstruction Following

Instruction following evaluation — measures adherence to complex instructions

ARC-AGI-1Reasoning

Original ARC abstract reasoning benchmark — visual pattern puzzles testing fluid intelligence

Vending-Bench 2Agent

Long-term coherence benchmark — AI manages a simulated vending machine business for a year, scored on profit

Alpha Arena S1Agent

Season 1 (Oct–Nov 2025) — AI models trade $10K in crypto perpetuals on Hyperliquid, scored on % returns

Alpha Arena S1.5Agent

Season 1.5 (Nov–Dec 2025) — AI models trade $10K in US stocks (TSLA, NVDA, MSFT, AMZN, QQQ), scored on % returns

Prophet ArenaAgent

Live forecasting benchmark by SIGMA Lab (UChicago) — AI predicts real-world events on Kalshi prediction markets, scored by Brier accuracy (1 − Brier, higher is better)

EQ-Bench 3 (Elo)EQ

Emotional intelligence benchmark scored by Elo from pairwise LLM-judge comparisons (EQ-Bench 3).

BrowseCompAgent

Agentic web search — multi-step browsing, research, and information extraction from real websites

OSWorldAgent

Computer use benchmark — GUI interaction, desktop automation, real OS tasks (369 tasks)

Terminal-Bench 2.0Agent

Agentic terminal coding — command-line navigation, shell operations, development tasks

τ2-bench RetailAgent

Agentic tool use — multi-step planning and function invocation in consumer retail scenarios

MCP AtlasAgent

Scaled tool use — coordinating many tools simultaneously in complex agent workflows

MMMU ProMultimodal

Multimodal understanding — complex visual reasoning across academic disciplines (no tools)

APEX-AgentsAgent

Long-horizon professional agentic tasks — sustained multi-step autonomous work

τ2-bench TelecomAgent

Multi-step telecom customer service tasks — tests tool use and policy compliance

PinchBenchAgent

Agentic task success rate — 23 real-world tasks including calendar, research, email, file management, and multi-step workflows