AI Debugging Still Falls Short, Microsoft Study Finds

AI coding tools are advancing fast—but they’re still not ready to replace human developers when it comes to debugging software.

Despite bold claims from tech leaders about AI’s growing role in software development, a new study from Microsoft Research shows just how far these tools still have to go. According to the report, even top-tier models from OpenAI and Anthropic are struggling to fix common bugs that wouldn’t trouble experienced programmers.

Microsoft researchers tested nine leading AI models using a benchmark known as SWE-bench Lite. This benchmark includes 300 real-world debugging tasks. The AI models were embedded in a single-agent system that had access to standard debugging tools, including a Python debugger. The goal? See how many tasks the AI could fix on its own.

The results weren’t promising. Anthropic’s Claude 3.7 Sonnet led the pack with a 48.4% success rate. OpenAI’s o1 and o3-mini followed with 30.2% and 22.1%, respectively. Even with powerful models and built-in tools, the AI rarely solved more than half of the issues.

So, what’s holding these tools back?

According to the study, one key reason is that the models struggle to use the available debugging tools effectively. More critically, there’s a shortage of training data showing real human debugging processes. These “sequential decision-making traces” are vital for teaching AI how to think like a developer—spotting issues, testing hypotheses, and zeroing in on solutions.

The researchers emphasized the need for new training methods: “We strongly believe that training or fine-tuning [models] can make them better interactive debuggers,” they wrote. But doing so will require collecting specialized data, such as logs of developers interacting with debugging tools and walking through fixes step-by-step.

This isn’t the first time AI’s coding limitations have been exposed. Earlier evaluations have shown that popular code-generating models often introduce security flaws and logical errors. For instance, the AI coding assistant Devin completed just 3 out of 20 tasks in another recent test.

Yet Microsoft’s latest findings offer one of the most detailed snapshots so far of the gap between AI and skilled software engineers. While the buzz around AI tools continues to grow, this study could serve as a cautionary reminder for teams eager to hand over debugging to machines.

Not everyone is buying into the idea that AI will replace developers anytime soon. Microsoft co-founder Bill Gates, Replit CEO Amjad Masad, IBM CEO Arvind Krishna, and Okta CEO Todd McKinnon have all said they believe programming will remain a vital human job for the foreseeable future.

And with results like these, they may be right.

Share with others