Benchmarking research shows leading AI models still struggle to reliably produce structured outputs used in software development
New research from the University of Waterloo shows that artificial intelligence still struggles with some basic software development tasks, raising questions about how reliably AI systems can assist developers.
As large language models are increasingly incorporated into software development, developers have found it difficult to ensure that AI-generated responses are accurate, consistent, and easy to integrate into larger development workflows.
Previously, LLMs responded to software development prompts with free-form natural language answers. To address this problem, several AI companies, including OpenAI, Google and Anthropic, have introduced “structured outputs.” These outputs force LLM responses to follow predefined formats such as JSON, XML, or Markdown, making them easier for both humans and software systems to read and process.
A new benchmarking study from Waterloo, however, shows that the technology is not yet as reliable as many developers had hoped. Even the most advanced models achieved only about 75 per cent accuracy in the tests, while open-source models performed closer to 65 per cent.