OpenAI’s LLM just earned a gold medal at the 2025 International Mathematical Olympiad (IMO)
OpenAI’s LLM just earned a gold medal at the 2025 International Mathematical Olympiad (IMO)—a milestone that many AI researchers once thought was a decade away. The experimental reasoning model, run under the same 4½‑hour, no‑internet rules as human contestants, solved five of the six problems and scored 35 out of 42 points, comfortably above this year’s gold cut‑off.
Three former IMO medalists marked each proof and reached unanimous agreement on the score, confirming that the machine’s work meets the same bar of rigor demanded of the world’s top teenage mathematicians. (Simon Willison’s Weblog)
What is the IMO and why does it matter?
The International Mathematical Olympiad is the oldest and most prestigious annual competition for pre‑university students. Over two consecutive days, contestants tackle three fresh problems each day; every problem is worth seven points, so a perfect score is 42.
In recent years more than 100 countries have sent six‑person teams, yet only about the top 8 percent of individuals walk away with gold.(Wikipedia, imo-official.org)
A proving ground for AI reasoning
Unlike benchmarks that accept a short numeric answer, the IMO demands page‑long proofs that experts must read line by line. That makes it a stress test for two capabilities that large language models still struggle with: deep logical planning and faithful, verifiable explanations.
Passing the IMO therefore suggests an LLM can move beyond pattern‑matching into sustained, checkable reasoning—an ability crucial for scientific discovery, formal verification, and safety‑critical decision‑making.(Simon Willison’s Weblog)
Inside OpenAI’s gold‑medal run
OpenAI’s team did not create a bespoke “IMO solver.” Instead they scaled up a general‑purpose LLM, added longer “thinking” time at inference, and introduced new reinforcement‑learning techniques that reward solutions only after independent graders pronounce them correct.
The result: perfect or near‑perfect proofs on Problems 1–5 and a respectable partial on the notoriously brutal Problem 6.
Their 35/42 score exceeds the historical gold threshold (29 points in 2024), showing that the program would have stood on the podium alongside human champions.(Simon Willison’s Weblog, imo-official.org)
How did Google DeepMind do in 2024?
Just one year earlier, Google DeepMind’s AlphaProof and AlphaGeometry 2 systems combined language‑model reasoning with the self‑play techniques that once mastered Go and chess.
Working together they produced valid proofs for four of the six 2024 IMO problems—good enough for a silver‑medal level performance, but still shy of gold.
The researchers highlighted their “neuro‑symbolic” recipe: the Gemini LLM translated each question into the Lean proof language, after which a search‑and‑verify engine stitched together formal proofs the Lean checker could certify.(WIRED, LessWrong)
Why the leap from silver to gold is significant
Moving from solving 4 problems to 5 may sound incremental, but the jump from silver to gold at the IMO is steep: each additional point typically vaults a contestant past dozens of peers.
For AI, clearing that bar signals that long‑horizon reasoning is starting to scale like other LLM capabilities.
It also validates the research strategy of letting one flexible model tackle many domains, rather than building isolated expert systems.
What’s next?
OpenAI says the gold‑medal model is a research prototype that will remain in the lab for now, but the broader lesson is public: reinforced, long‑context LLMs can already match the best teenage mathematicians on fresh, open‑ended proofs. If that rate of progress continues, the next frontier may be undergraduate math contests—or even unsolved research problems.
For human mathematicians the machines look less like replacements and more like tireless collaborators, generating candidate arguments that experts can refine.
Either way, 2025 will be remembered as the year competitive mathematics joined chess and Go as a discipline where AI reached the very top tier—and did so in language we can all read.(Simon Willison’s Weblog, WIRED)