The Sonnet That Ate Opus's Lunch
Anthropic released Claude Sonnet 4.6 on February 17, 2026, and the benchmarks tell a story that should make every model pricing team nervous: a mid-tier model priced at $3/$15 per million tokens is now matching or beating flagship models that cost 5โ10x more.
Sonnet 4.6 is immediately available as the default model for Free and Pro plans on claude.ai. It features a 1 million token context window (in beta), upgrades across coding, computer use, long-context reasoning, and agent planning, plus new features like adaptive thinking, context compaction, and improved prompt injection resistance.
The Benchmark Blitz
Let's get into the numbers. This is a comprehensive upgrade across every category.
Coding: Breathing Down Opus's Neck
| Benchmark | Sonnet 4.6 | Opus 4.6 | GPT-5.2 |
|---|---|---|---|
| SWE-bench Verified | 79.6% | 80.8% | 77.0% |
| Terminal-Bench 2.0 | 59.1% | 62.7% | โ |
Sonnet 4.6 is within 1.2 percentage points of Opus 4.6 on SWE-bench โ the gold standard for real-world software engineering โ while costing roughly one-fifth as much. For most teams, the cost-performance tradeoff just tipped decisively toward Sonnet.
Computer Use: The Exponential Curve
This is where the story gets wild. OSWorld-Verified measures a model's ability to actually use a computer โ clicking, typing, navigating applications:
| Date | Model | OSWorld Score |
|---|---|---|
| Oct 2024 | Claude 3.5 Sonnet | 14.9% |
| Feb 2025 | Claude 3.7 Sonnet | 28.0% |
| Jun 2025 | Claude 4 Sonnet | 42.2% |
| Oct 2025 | Claude Sonnet 4.5 | 61.4% |
| Feb 2026 | Claude Sonnet 4.6 | 72.5% |
That's 14.9% โ 72.5% in 16 months. For comparison, GPT-5.2 scores 38.2% on the same benchmark. Opus 4.6 edges out Sonnet at 72.7%, but the gap is essentially noise.
Computer use is rapidly moving from "neat demo" to "actually useful." At 72.5%, Sonnet 4.6 can reliably navigate complex multi-step workflows across desktop applications.
Reasoning: The ARC-AGI Moonshot
| Benchmark | Sonnet 4.6 | Sonnet 4.5 | Opus 4.6 |
|---|---|---|---|
| ARC-AGI-2 | 58.3% | 13.6% | 75.2% |
| GPQA Diamond | 74.1% | 83.4% | 74.5% |
| HLE | 19.1% | 19.8% | 26.3% |
| MATH-500 | 97.8% | 96.2% | 97.6% |
The ARC-AGI-2 result demands attention: 13.6% โ 58.3% is a 4.3x improvement in a single generation. ARC-AGI-2 specifically tests abstract reasoning that resists memorization โ you can't benchmark-hack your way to a 4.3x jump. Something fundamental changed in how Sonnet reasons about novel patterns.
Agent Benchmarks: Where Sonnet Beats Everything
Here's where it gets interesting. On several real-world agent benchmarks, Sonnet 4.6 doesn't just approach Opus โ it beats every model tested:
| Benchmark | Sonnet 4.6 | Opus 4.6 | GPT-5.2 |
|---|---|---|---|
| GDPval-AA Office Elo | 1633 | 1606 | โ |
| Finance Agent | 63.3% | 60.1% | โ |
| MCP-Atlas (tool use) | 61.3% | 60.3% | โ |
| ฯยฒ-bench Retail | 91.7% | โ | โ |
| ฯยฒ-bench Telecom | 97.9% | โ | โ |
Sonnet 4.6 is the #1 model in the world on office productivity (GDPval), financial agent tasks, and scaled tool use (MCP-Atlas). The model that costs $3 per million input tokens outperforms the model that costs $15.
Vending-Bench: The Business Simulator
Vending-Bench gives an AI model a simulated vending machine business and measures how much revenue it generates over a simulated year:
| Model | Revenue |
|---|---|
| Sonnet 4.5 | $2,100 |
| Sonnet 4.6 | ~$5,700 |
| Opus 4.6 | $7,400 |
Sonnet 4.6 nearly tripled its predecessor's revenue. The analysis revealed an interesting strategy: Sonnet 4.6 invested heavily in building capacity during the first 10 simulated months, then pivoted aggressively to profitability. It played the long game.
Knowledge & Understanding
| Benchmark | Sonnet 4.6 |
|---|---|
| MMLU-Pro | 79.1% |
| MATH-500 | 97.8% |
User Preferences
In head-to-head blind evaluations:
- 70% of users preferred Sonnet 4.6 over Sonnet 4.5
- 59% preferred Sonnet 4.6 over Opus 4.5 (the previous-generation flagship)
When users can't tell the difference between your $3 model and last quarter's $15 flagship โ and actually prefer the cheaper one โ that's a pricing problem for the premium tier.
New Features
Beyond raw performance:
- Adaptive Thinking: Dynamically adjusts reasoning depth based on query complexity โ simple questions get fast answers, hard problems get deep chains of thought
- Context Compaction: Intelligently compresses long conversation histories to maintain coherence within the 1M token window without losing critical information
- Prompt Injection Resistance: Improved defenses against adversarial prompt injection attacks โ critical for agent deployment where models process untrusted content
- Claude in Excel: Native integration for spreadsheet workflows (details TBD)
What This Means
Sonnet 4.6 continues a trend that should worry every AI lab: the mid-tier is eating the flagship. At $3/$15 per million tokens โ versus $15/$75 for Opus โ Sonnet 4.6 delivers 95-100% of flagship performance on coding, computer use, and agent tasks, and actually exceeds it on office productivity and finance.
The computer use trajectory is the most important chart in AI right now. Going from 14.9% to 72.5% in 16 months isn't incremental improvement โ it's a capability that's crossing the threshold from research curiosity to production deployment. At this rate, 90%+ computer use scores by late 2026 seem plausible.
And the ARC-AGI-2 jump โ 4.3x in one generation โ suggests Anthropic found something in the reasoning architecture that unlocked a step change in abstract problem-solving. That's the kind of improvement that makes the next generation very hard to predict.
Sonnet 4.6 is available now on claude.ai and through the Anthropic API.