Claude Sonnet 4.6: The Mid-Range Model That Keeps Embarrassing Flagships

The Sonnet That Ate Opus's Lunch

Anthropic released Claude Sonnet 4.6 on February 17, 2026, and the benchmarks tell a story that should make every model pricing team nervous: a mid-tier model priced at $3/$15 per million tokens is now matching or beating flagship models that cost 5–10x more.

Sonnet 4.6 is immediately available as the default model for Free and Pro plans on claude.ai. It features a 1 million token context window (in beta), upgrades across coding, computer use, long-context reasoning, and agent planning, plus new features like adaptive thinking, context compaction, and improved prompt injection resistance.

The Benchmark Blitz

Let's get into the numbers. This is a comprehensive upgrade across every category.

Coding: Breathing Down Opus's Neck

Benchmark	Sonnet 4.6	Opus 4.6	GPT-5.2
SWE-bench Verified	79.6%	80.8%	77.0%
Terminal-Bench 2.0	59.1%	62.7%	—

Sonnet 4.6 is within 1.2 percentage points of Opus 4.6 on SWE-bench — the gold standard for real-world software engineering — while costing roughly one-fifth as much. For most teams, the cost-performance tradeoff just tipped decisively toward Sonnet.

Computer Use: The Exponential Curve

This is where the story gets wild. OSWorld-Verified measures a model's ability to actually use a computer — clicking, typing, navigating applications:

Date	Model	OSWorld Score
Oct 2024	Claude 3.5 Sonnet	14.9%
Feb 2025	Claude 3.7 Sonnet	28.0%
Jun 2025	Claude 4 Sonnet	42.2%
Oct 2025	Claude Sonnet 4.5	61.4%
Feb 2026	Claude Sonnet 4.6	72.5%

That's 14.9% → 72.5% in 16 months. For comparison, GPT-5.2 scores 38.2% on the same benchmark. Opus 4.6 edges out Sonnet at 72.7%, but the gap is essentially noise.

Computer use is rapidly moving from "neat demo" to "actually useful." At 72.5%, Sonnet 4.6 can reliably navigate complex multi-step workflows across desktop applications.

Reasoning: The ARC-AGI Moonshot

Benchmark	Sonnet 4.6	Sonnet 4.5	Opus 4.6
ARC-AGI-2	58.3%	13.6%	75.2%
GPQA Diamond	74.1%	83.4%	74.5%
HLE	19.1%	19.8%	26.3%
MATH-500	97.8%	96.2%	97.6%

The ARC-AGI-2 result demands attention: 13.6% → 58.3% is a 4.3x improvement in a single generation. ARC-AGI-2 specifically tests abstract reasoning that resists memorization — you can't benchmark-hack your way to a 4.3x jump. Something fundamental changed in how Sonnet reasons about novel patterns.

Agent Benchmarks: Where Sonnet Beats Everything

Here's where it gets interesting. On several real-world agent benchmarks, Sonnet 4.6 doesn't just approach Opus — it beats every model tested:

Benchmark	Sonnet 4.6	Opus 4.6	GPT-5.2
GDPval-AA Office Elo	1633	1606	—
Finance Agent	63.3%	60.1%	—
MCP-Atlas (tool use)	61.3%	60.3%	—
τ²-bench Retail	91.7%	—	—
τ²-bench Telecom	97.9%	—	—

Sonnet 4.6 is the #1 model in the world on office productivity (GDPval), financial agent tasks, and scaled tool use (MCP-Atlas). The model that costs $3 per million input tokens outperforms the model that costs $15.

Vending-Bench: The Business Simulator

Vending-Bench gives an AI model a simulated vending machine business and measures how much revenue it generates over a simulated year:

Model	Revenue
Sonnet 4.5	$2,100
Sonnet 4.6	~$5,700
Opus 4.6	$7,400

Sonnet 4.6 nearly tripled its predecessor's revenue. The analysis revealed an interesting strategy: Sonnet 4.6 invested heavily in building capacity during the first 10 simulated months, then pivoted aggressively to profitability. It played the long game.

Knowledge & Understanding

Benchmark	Sonnet 4.6
MMLU-Pro	79.1%
MATH-500	97.8%

User Preferences

In head-to-head blind evaluations:

70% of users preferred Sonnet 4.6 over Sonnet 4.5
59% preferred Sonnet 4.6 over Opus 4.5 (the previous-generation flagship)

When users can't tell the difference between your $3 model and last quarter's $15 flagship — and actually prefer the cheaper one — that's a pricing problem for the premium tier.

New Features

Beyond raw performance:

Adaptive Thinking: Dynamically adjusts reasoning depth based on query complexity — simple questions get fast answers, hard problems get deep chains of thought
Context Compaction: Intelligently compresses long conversation histories to maintain coherence within the 1M token window without losing critical information
Prompt Injection Resistance: Improved defenses against adversarial prompt injection attacks — critical for agent deployment where models process untrusted content
Claude in Excel: Native integration for spreadsheet workflows (details TBD)

What This Means

Sonnet 4.6 continues a trend that should worry every AI lab: the mid-tier is eating the flagship. At $3/$15 per million tokens — versus $15/$75 for Opus — Sonnet 4.6 delivers 95-100% of flagship performance on coding, computer use, and agent tasks, and actually exceeds it on office productivity and finance.

The computer use trajectory is the most important chart in AI right now. Going from 14.9% to 72.5% in 16 months isn't incremental improvement — it's a capability that's crossing the threshold from research curiosity to production deployment. At this rate, 90%+ computer use scores by late 2026 seem plausible.

And the ARC-AGI-2 jump — 4.3x in one generation — suggests Anthropic found something in the reasoning architecture that unlocked a step change in abstract problem-solving. That's the kind of improvement that makes the next generation very hard to predict.

Sonnet 4.6 is available now on claude.ai and through the Anthropic API.

The Sonnet That Ate Opus's Lunch

The Benchmark Blitz

Let's get into the numbers. This is a comprehensive upgrade across every category.

Coding: Breathing Down Opus's Neck

Benchmark	Sonnet 4.6	Opus 4.6	GPT-5.2
SWE-bench Verified	79.6%	80.8%	77.0%
Terminal-Bench 2.0	59.1%	62.7%	—

Computer Use: The Exponential Curve

This is where the story gets wild. OSWorld-Verified measures a model's ability to actually use a computer — clicking, typing, navigating applications:

Date	Model	OSWorld Score
Oct 2024	Claude 3.5 Sonnet	14.9%
Feb 2025	Claude 3.7 Sonnet	28.0%
Jun 2025	Claude 4 Sonnet	42.2%
Oct 2025	Claude Sonnet 4.5	61.4%
Feb 2026	Claude Sonnet 4.6	72.5%

That's 14.9% → 72.5% in 16 months. For comparison, GPT-5.2 scores 38.2% on the same benchmark. Opus 4.6 edges out Sonnet at 72.7%, but the gap is essentially noise.

Computer use is rapidly moving from "neat demo" to "actually useful." At 72.5%, Sonnet 4.6 can reliably navigate complex multi-step workflows across desktop applications.

Reasoning: The ARC-AGI Moonshot

Benchmark	Sonnet 4.6	Sonnet 4.5	Opus 4.6
ARC-AGI-2	58.3%	13.6%	75.2%
GPQA Diamond	74.1%	83.4%	74.5%
HLE	19.1%	19.8%	26.3%
MATH-500	97.8%	96.2%	97.6%

Agent Benchmarks: Where Sonnet Beats Everything

Here's where it gets interesting. On several real-world agent benchmarks, Sonnet 4.6 doesn't just approach Opus — it beats every model tested:

Benchmark	Sonnet 4.6	Opus 4.6	GPT-5.2
GDPval-AA Office Elo	1633	1606	—
Finance Agent	63.3%	60.1%	—
MCP-Atlas (tool use)	61.3%	60.3%	—
τ²-bench Retail	91.7%	—	—
τ²-bench Telecom	97.9%	—	—

Vending-Bench: The Business Simulator

Vending-Bench gives an AI model a simulated vending machine business and measures how much revenue it generates over a simulated year:

Model	Revenue
Sonnet 4.5	$2,100
Sonnet 4.6	~$5,700
Opus 4.6	$7,400

Knowledge & Understanding

Benchmark	Sonnet 4.6
MMLU-Pro	79.1%
MATH-500	97.8%

User Preferences

In head-to-head blind evaluations:

70% of users preferred Sonnet 4.6 over Sonnet 4.5
59% preferred Sonnet 4.6 over Opus 4.5 (the previous-generation flagship)

When users can't tell the difference between your $3 model and last quarter's $15 flagship — and actually prefer the cheaper one — that's a pricing problem for the premium tier.

New Features

Beyond raw performance:

Adaptive Thinking: Dynamically adjusts reasoning depth based on query complexity — simple questions get fast answers, hard problems get deep chains of thought
Context Compaction: Intelligently compresses long conversation histories to maintain coherence within the 1M token window without losing critical information
Prompt Injection Resistance: Improved defenses against adversarial prompt injection attacks — critical for agent deployment where models process untrusted content
Claude in Excel: Native integration for spreadsheet workflows (details TBD)

What This Means

Sonnet 4.6 is available now on claude.ai and through the Anthropic API.

Claude Sonnet 4.6: The Mid-Range Model That Keeps Embarrassing Flagships

Key Takeaways

The Sonnet That Ate Opus's Lunch

The Benchmark Blitz

Coding: Breathing Down Opus's Neck

Computer Use: The Exponential Curve

Reasoning: The ARC-AGI Moonshot

Agent Benchmarks: Where Sonnet Beats Everything

Vending-Bench: The Business Simulator

Knowledge & Understanding

User Preferences

New Features

What This Means

References

Related Topics

Claude Sonnet 4.6: The Mid-Range Model That Keeps Embarrassing Flagships

Key Takeaways

The Sonnet That Ate Opus's Lunch

The Benchmark Blitz

Coding: Breathing Down Opus's Neck

Computer Use: The Exponential Curve

Reasoning: The ARC-AGI Moonshot

Agent Benchmarks: Where Sonnet Beats Everything

Vending-Bench: The Business Simulator

Knowledge & Understanding

User Preferences

New Features

What This Means

References

Related Topics