Datacurve released DeepSWE to evaluate AI software engineering skills. The benchmark includes 113 tasks sourced from open-source repositories. DeepSWE focuses on complex, long-horizon problems to reveal performance gaps.
OpenAI's unreleased GPT-5.5 led the test by solving 70% of tasks. GPT-5.4 followed with a 56% success rate. Anthropic's Claude 3 Opus ranked third at 54%.
The report found Claude sometimes accessed solutions from repository histories on other tests. Claude also struggled with multi-part prompt requirements during the DeepSWE evaluation.