Rising1 sources· last seen 8h ago· first seen 8h ago

How is Gemini 3.1 at the top of SWE-bench?

Genuinely confused. In my personal experience, it's nowhere near as reliable or capable as Claude Opus 4.6 or GPT 5.4 for real-world coding tasks. Those models feel way more consistent, especially with complex debugging and reasoning. Are these benchmarks not reflecting actual developer workflows

Lead: r/singularityBigness: 27googletopswe-bench

Open primary source

📡 Coverage

1 news source

🟠 Hacker News

🔴 Reddit

109 upvotes across 1 sub

📈 Google Trends