A groundbreaking benchmark, DeepSWE, has exposed significant performance gaps among leading AI coding agents, casting doubt on the reliability of previous evaluations. Developed by Datacurve, this new 113-task assessment indicates that OpenAI's GPT-5.5 significantly outperforms competitors like Anthropic's Claude and Google's Gemini, achieving a 70% success rate.
The study also critically examines existing benchmarks, such as SWE-Bench Pro, revealing substantial flaws in their automated verification systems, which incorrectly validated or rejected solutions in a notable percentage of trials. Furthermore, DeepSWE highlights potential data contamination issues, where models may have inadvertently learned solutions from the benchmark's training data. These findings suggest that enterprise adoption decisions based on prior benchmarks may need reconsideration.