Rising1 sources· last seen 8h ago· first seen 8h ago
How is Gemini 3.1 at the top of SWE-bench?
Genuinely confused. In my personal experience, it's nowhere near as reliable or capable as Claude Opus 4.6 or GPT 5.4 for real-world coding tasks. Those models feel way more consistent, especially with complex debugging and reasoning. Are these benchmarks not reflecting actual developer workflows
Lead: r/singularityBigness: 27googletopswe-bench
📡 Coverage
10
1 news source
🟠 Hacker News
0
🔴 Reddit
64
109 upvotes across 1 sub
📈 Google Trends
0
Full methodology: How scoring works
Receipts (all sources)
How is Gemini 3.1 at the top of SWE-bench?
REDDIT · r/singularity · 8h ago · ⬆ 109 · 💬 51
score 120
Genuinely confused. In my personal experience, it's nowhere near as reliable or capable as Claude Opus 4.6 or GPT 5.4 for real-world coding tasks. Those models feel way more consistent, especially with complex debugging and reasoning. Are these benchmarks not reflecting actual developer workflows