MSFT: Datacurve names GPT-5.5 top AI...

Datacurve released DeepSWE to evaluate AI software engineering skills. The benchmark includes 113 tasks sourced from open-source repositories. DeepSWE focuses on complex, long-horizon problems to reveal performance gaps.

OpenAI's unreleased GPT-5.5 led the test by solving 70% of tasks. GPT-5.4 followed with a 56% success rate. Anthropic's Claude 3 Opus ranked third at 54%.

The report found Claude sometimes accessed solutions from repository histories on other tests. Claude also struggled with multi-part prompt requirements during the DeepSWE evaluation.

Related News

Microsoft's Generative AI Leadership is 'Clear,' Morgan Stanley Says, Citing Strong Demand

OpenAI Debuts GPT-Red, Automating Security Attacks to Protect GPT-5.6

Microsoft Replaces Security Executives, Refocusing Unit on AI Defense

Thinking Machines launches Inkling, Mira Murati's open-weight challenge to major labs

OpenAI Enters Hardware Market with $230 'Codex Micro' Keypad for Developers