AI Just Got Scary Good at Breaking Digital Vaults
Imagine billions of dollars locked inside robot-operated vaults with no guards, no manager, no phone number to call — just code that follows rules automatically. That's what smart contracts are: programs that hold and move cryptocurrency based on pre-written instructions, with nobody in charge.
OpenAI just teamed up with Paradigm, one of crypto's most influential investment firms, to build EVMbench — a standardized test for AI hackers. They took 120 real security flaws from 40 actual security audits and asked frontier AI models to find them, fix them, and exploit them.
The result should make anyone holding crypto sit up straight: GPT-5.3-Codex can now crack 72.2% of critical vulnerabilities. When they started building this benchmark, the best models could barely manage 20%.
That's not gradual improvement. That's going from "D student" to "honor roll" in months.
What EVMbench Actually Tests
The benchmark evaluates AI agents across three modes, each progressively harder:
🔍 Detect — The AI audits a smart contract codebase and tries to find every security flaw. Think of it as reading every line of a bank vault's blueprint looking for weak spots. Scored on how many known bugs it catches.
🔧 Patch — The AI must fix vulnerabilities without breaking the contract's intended functionality. This is surgery, not demolition — you need to remove the tumor without killing the patient.
💥 Exploit — The AI attempts to drain funds from vulnerable contracts deployed on a sandboxed blockchain. Full end-to-end attacks, graded by whether money actually moves.
The testing infrastructure is serious: a Rust-based harness deploys contracts onto a local Anvil blockchain instance, replays agent transactions deterministically, and restricts unsafe methods. No live networks touched. All vulnerabilities are historical and publicly documented.
Most of the test cases come from Code4rena, the platform where human security researchers compete to find bugs in crypto protocols for bounties. The benchmark also includes scenarios from Stripe's Tempo blockchain — their purpose-built L1 for stablecoin payments — grounding the tests in the kind of payment infrastructure that's about to go mainstream.
The Numbers That Matter
| Model | Exploit Score | Released |
|---|---|---|
| Top models (early development) | <20% | — |
| GPT-5 | 31.9% | ~Aug 2025 |
| GPT-5.3-Codex | 72.2% | Feb 2026 |
That's a 3.5x improvement in roughly six months.
But here's what the press release buries: AI is dramatically better at attacking than defending. Detect and Patch modes score significantly lower. The AI tends to stop after finding one bug rather than exhaustively auditing the codebase, and when patching, it frequently breaks the contract's intended functionality while trying to fix the vulnerability.
This is the sword-vs-shield problem. Breaking things requires finding just one crack. Defending requires covering every crack. AI is amplifying this asymmetry, not resolving it.
Why This Matters Beyond Crypto
"I don't own crypto — why should I care?"
Three reasons:
Stripe is involved. Stripe processes payments for Amazon, Shopify, DoorDash — basically half the internet's checkout buttons. They're building blockchain payment rails with Tempo. If AI can hack those contracts, it touches regular commerce. The inclusion of Tempo scenarios isn't academic curiosity — it's pre-launch stress-testing, which signals Stripe is closer to deploying than most realize.
Contagion is real. When $100B+ in assets is at risk, collapses ripple outward into venture capital, pensions with crypto exposure, and tech companies dependent on that funding ecosystem.
Smart contracts aren't just DeFi anymore. They're being adopted for insurance, real estate, supply chains, and enterprise payments. The same AI that hacks a DeFi vault today could hack your automated insurance claim tomorrow.
Uncommon Insights
The following insights are informed, educated guesses drawn from multiple AI analyses — not established facts. They represent the kind of non-obvious thinking that experienced observers would apply to this story.
- The benchmark may be easier than it looks. These are already-discovered bugs from public audit reports. The AI isn't finding novel vulnerability classes — it's pattern-matching against known exploit archetypes. This is closer to "can you solve a practice exam" than "can you pass the real test." The jump to novel zero-days in live contracts is a much harder problem. But even 15-20% success on unknown vulnerabilities across $100B in contracts would be catastrophic.
- Automated exploit weaponization is now viable. Someone will build a system that monitors new contract deployments → runs exploit detection → auto-constructs drain transactions → executes via private transaction relays. The time window between deployment and exploit shrinks from days to minutes. This is the smart contract equivalent of automated CVE exploitation, but with immediate financial payoff and pseudonymous execution.
- The audit industry faces an extinction curve. Smart contract auditing is a $500M+/year industry. If AI Detect mode hits 85%+ within 18 months, the economic floor drops out of manual auditing. The best human auditors become AI-augmented supervisors; the median auditor becomes obsolete. Ironically, Code4rena's competitive model — paying humans per-bug bounties — is generating the training data for its own disruption.
- OpenAI is running the AWS playbook. They're not launching an audit firm. They're selling picks and shovels — letting others build security products on their API and taking margin on every call. At even $100 per API-powered audit, the 1.7M contracts deployed weekly on Ethereum alone represents a massive revenue opportunity.
- The $10M defense commitment deserves scrutiny. It costs OpenAI roughly $2-3M in actual compute. It creates dependency on OpenAI's API for security tooling. It generates high-quality training data from security researchers using the API. And it provides PR cover for shipping offense-first capabilities. This is a platform play disguised as philanthropy.
- This is regulatory ammunition. Regulators now have a concrete, quotable number: "AI can exploit 72% of known smart contract vulnerabilities in systems holding $100B+ in consumer funds." Expect this benchmark cited in SEC, CFTC, and EU MiCA enforcement actions within 12 months.
- Paradigm's involvement tells you something. Paradigm doesn't do PR partnerships — they do strategic bets. Co-authoring a security benchmark means they believe AI×crypto security is a mega-theme, not a niche. Their portfolio companies get early access to defensive tooling. But it also means a VC firm with financial positions in protocols helped design the benchmark that evaluates those protocols' security. The conflict of interest is structural.
- EVM itself becomes a liability. If AI is this good at exploiting EVM contracts, the argument for formally verified smart contract languages — Move (Aptos/Sui), Cairo (Starknet), Rust-based VMs — strengthens enormously. Ethereum compatibility may become a security risk, not just a technical choice.
What Happens Next
Near-term (0-6 months):
- AI-native audit startups raise at inflated valuations
- Smart contract insurance protocols become actuarially viable for the first time
- Other AI labs (Anthropic, Google) rush to publish competing security benchmarks
- The exploit improvement curve continues — 85%+ on known patterns is likely by summer
Medium-term (6-18 months):
- Mid-tier manual audit firms face severe margin compression or pivot to AI-augmented models
- Regulators begin citing EVMbench scores in enforcement guidance
- On-chain AI agents managing wallets create entirely new attack surfaces
- Stripe competitors need equivalent security narratives for their own payment infrastructure
The big question nobody's answering: What's the offensive/defensive capability ratio over time? If Exploit mode improves 2x faster than Detect mode, we have a growing security gap. OpenAI should publish this curve explicitly. They conspicuously aren't.
Further Reading
- Introducing EVMbench — OpenAI — The official announcement with methodology details and full benchmark results
- EVMbench: An Open Benchmark for Smart Contract Security Agents — Paradigm — Paradigm's perspective on why they built this and where they see it going
- EVMbench Technical Paper (PDF) — OpenAI/Paradigm — The full academic paper with detailed methodology, grading criteria, and model comparisons
- Can AI Agents Boost Ethereum Security? — Decrypt — Good overview connecting EVMbench to Stripe's Tempo and the weekly smart contract deployment numbers
- OpenAI Unveils EVMbench to Test AI on Smart Contract Security — CoinDesk — The crypto industry's reaction and broader context on AI-crypto convergence
- OpenAI and Paradigm Partner on Smart Contract Security — The Block — Industry analysis of the partnership dynamics and competitive implications
- Strengthening Cyber Resilience — OpenAI — OpenAI's broader cybersecurity strategy and safeguards framework that EVMbench feeds into
- Introducing Aardvark — OpenAI — The security research agent that OpenAI is expanding alongside EVMbench