Google Ships 3.1 Pro
🎮 See it in action: We prompted Gemini 3.1 Pro to build interactive demos on launch day — no human code. View the demos →
Google just released Gemini 3.1 Pro — and the benchmark numbers are significant. This isn't a Flash-tier incremental bump. The core reasoning model that powers the Gemini ecosystem just got substantially smarter.
Starting today, 3.1 Pro is rolling out across:
- Developers: Gemini API in AI Studio, Gemini CLI, Google Antigravity, Android Studio
- Enterprise: Vertex AI and Gemini Enterprise
- Consumers: Gemini app and NotebookLM
It's in preview, not GA — Google wants to validate the updates and test agentic workflows before full release. But you can use it right now.
The Numbers
The headline: 77.1% on ARC-AGI-2, verified by the benchmark authors. For context, Gemini 3 Pro scored 31.1% on the same benchmark three months ago. That's a 2.5× improvement in abstract reasoning.
Here's how 3.1 Pro stacks up (all scores use Thinking High mode where applicable):
| Benchmark | Gemini 3.1 Pro | Gemini 3 Pro | Best Competitor |
|---|---|---|---|
| ARC-AGI-2 (abstract reasoning) | 77.1% | 31.1% | Opus 4.6: 68.8% |
| GPQA Diamond (PhD-level science) | 94.3% | 91.9% | GPT-5.2: 92.4% |
| HLE (Humanity's Last Exam) | 44.4% | 37.5% | Opus 4.6: 40.0% |
| SWE-bench Verified (agentic coding) | 80.6% | 76.2% | Opus 4.6: 80.8% |
| Terminal-Bench 2.0 (terminal coding) | 68.5% | 56.9% | Opus 4.6: 65.4% |
| BrowseComp (agentic search + code) | 85.9% | 59.2% | — |
| APEX-Agents (long-horizon professional) | 33.5% | 18.4% | — |
| LiveCodeBench Pro (competitive coding) | 2887 Elo | 2439 Elo | — |
| MMMLU (multilingual Q&A) | 92.6% | 91.8% | GPT-5.2: 91.0% |
| MMMU-Pro (multimodal understanding) | 80.5% | 81.0% | — |
| MRCR v2 (long-context 128k) | 84.9% | 77.0% | Tied Sonnet 4.6 |
| SciCode (scientific coding) | 59% | 56% | — |
Three results jump off the page:
ARC-AGI-2 at 77.1% is now the highest score from any non-thinking model. For abstract reasoning — solving entirely new logic patterns that resist memorization — Google just leapfrogged Claude Opus 4.6 (68.8%) by over 8 points. Only Gemini 3 Deep Think (84.6%) scores higher.
GPQA Diamond at 94.3% is the new all-time high on PhD-level science questions. GPT-5.2 held the record at 92.4%. Google just took it.
The agentic scores are the real story. BrowseComp jumped from 59.2% to 85.9% (+45% relative). APEX-Agents went from 18.4% to 33.5% (+82% relative). Terminal-Bench from 56.9% to 68.5%. These aren't incremental — they suggest fundamental improvements in how 3.1 Pro handles multi-step, long-horizon autonomous tasks.
SWE-bench Verified at 80.6% is the one benchmark where 3.1 Pro doesn't lead — Claude Opus 4.6 edges it at 80.8%. But given the massive gains everywhere else, this feels like the exception that proves the rule.
What Changed
Last week, Google released a major update to Gemini 3 Deep Think that hit 84.6% on ARC-AGI-2. Today's 3.1 Pro release is described as "the upgraded core intelligence that makes those breakthroughs possible." Translation: the reasoning improvements developed for Deep Think have been distilled back into the standard Pro model.
This is the same playbook OpenAI used with o1 → GPT-5: develop reasoning capabilities in a specialized thinking model, then fold those gains back into the general-purpose model. Google is executing it faster — Deep Think update to 3.1 Pro in one week.
The architecture is likely a sparse MoE transformer with native multimodal training (consistent with the Gemini 3 series). Context window remains 1M input tokens / 64K output tokens. Knowledge cutoff is January 2025.
Google Antigravity
Buried in the announcement: 3.1 Pro is available on Google Antigravity, which Google describes as their "agentic development platform." This is Google's answer to Anthropic's computer use and OpenAI's Codex — a platform where AI models can autonomously execute multi-step workflows.
The agentic benchmark jumps (BrowseComp, APEX-Agents, Terminal-Bench) aren't coincidental — Google is positioning 3.1 Pro as an agentic-first model. The fact that they're shipping it on Antigravity alongside the model launch tells you where the next phase of competition is headed.
Pricing and Access
Consumer tiers on the Gemini app:
- Free: Limited 3.1 Pro access
- AI Plus ($7.99/mo, intro $3.99 for 2 months): Enhanced access
- AI Pro ($19.99/mo, 1 month free): Higher limits + Nano Banana Pro images + Veo 3.1 Fast video
- AI Ultra ($249.99/mo, intro $124.99 for 3 months): Highest limits, Deep Think mode, Gemini Agent, Project Mariner, YouTube Premium
College students get free Pro for 1 year.
Developer API pricing (preview, mirrors Gemini 3 Pro):
- Input: ~$2–$3.60/M tokens (cached: ~$0.20–$0.36/M)
- Output: ~$6–$9/M tokens
- Context caching discounts available
Early Customer Reactions
JetBrains (IntelliJ, PyCharm) reported up to 15% improvement over the best Gemini 3 Pro Preview runs, calling it "stronger, faster, and more efficient, requiring fewer output tokens while delivering more reliable results."
Databricks noted 3.1 Pro achieved best-in-class results on OfficeQA, their benchmark for grounded reasoning on tabular and unstructured data.
Cartwheel (3D animation) highlighted "substantially improved understanding of 3D transformations" — a historically weak area for language models.
The Competitive Picture
With 3.1 Pro, the frontier landscape as of February 2026:
Reasoning (ARC-AGI-2):
- Gemini 3 Deep Think — 84.6%
- Gemini 3.1 Pro — 77.1% ← NEW
- Claude Opus 4.6 — 68.8%
- Claude Sonnet 4.6 — 58.3%
- GPT-5.2 — 52.9%
Science (GPQA Diamond):
- Gemini 3.1 Pro — 94.3% ← NEW
- GPT-5.2 — 92.4%
- Gemini 3 Pro — 91.9%
Hardest Benchmark (HLE):
- Gemini 3 Pro — 45.8%
- Gemini 3.1 Pro — 44.4% ← NEW
- Claude Opus 4.6 — 40.0%
- GPT-5 — 35.2%
Coding (SWE-bench Verified):
- Claude Sonnet 4.5 — 82.0%
- Claude Opus 4.5 — 80.9%
- Claude Opus 4.6 — 80.8%
- Gemini 3.1 Pro — 80.6% ← NEW
- GPT-5.2 — 80.0%
Google now holds the crown on reasoning and science. Anthropic still leads on coding — but the gap is one-tenth of a percentage point.
What This Means
1. Google's reasoning leap is real. 31.1% → 77.1% on ARC-AGI-2 in three months. That's not incremental improvement — that's a regime change. Whatever they figured out for Deep Think, it transfers.
2. The agentic benchmark era has begun. Half the benchmarks in this release (BrowseComp, APEX-Agents, Terminal-Bench, τ2-bench) didn't exist a year ago. The labs are now competing on whether their models can do things autonomously, not just answer questions. Google is leading that framing.
3. Preview is the new launch. Google, Anthropic, and OpenAI are all shipping models in "preview" or "early access" now. The line between preview and GA is marketing, not capability. If you're waiting for GA to try 3.1 Pro, you're already behind.
4. Pricing compression continues. At ~$2-3.60/M input tokens, 3.1 Pro is in the same ballpark as GPT-5.2 and Claude, but the cached input pricing (~$0.20/M) makes it dramatically cheaper for production workloads with repeated context.
Gemini 3.1 Pro is available now in preview. Try it in Google AI Studio.