A team of software engineering researchers has won the ACM SIGSOFT Distinguished Paper Award at FORGE 2026, the 3rd ACM International Conference on AI Foundation Models and Software Engineering, held as part of ICSE 2026, the 48th IEEE/ACM International Conference on Software Engineering.
The award recognizes recent master’s graduate Evelien Riddell, whose thesis forms the foundation of the paper, along with MMath student James Riddell, PhD student Gengyi Sun, research engineer Michał Antkiewicz and Professor Krzysztof Czarnecki. Their paper is titled “Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-based Root Cause Analysis.”
Based on Evelien’s master’s research, the work developed a controlled framework to evaluate how effectively large language models perform root cause analysis, the process of identifying the underlying source of system failures, in complex cloud environments
Their findings show that while current large language models often generate plausible diagnoses, their underlying reasoning is often flawed, with accuracy deteriorating as agentic complexity increases. These results suggest that improving root cause analysis will require more than simply adding agents. Instead, it will depend on designing systems that reason carefully, systematically, and transparently.

Evelien Riddell is a recent master’s graduate from the Cheriton School of Computer Science. She conducted this award-winning research under the supervision of Professor Krzysztof Czarnecki, who leads the Waterloo Intelligent Systems Engineering Lab. In addition to her MMath degree, she holds a Bachelor of Science in physics and computer science from the University of British Columbia.
More about this research
When a modern cloud service fails, engineers must trace a fault through a tangled web of microservices, databases, and hosts to find its origin, a process called root cause analysis. Multi-hop faults are particularly challenging, as visible symptoms may appear several service boundaries away from the true source. With their ability to reason over heterogeneous data and use external tools, large language models (LLMs) are a natural candidate for automating this process.
But how well do they actually reason?
Previous work often embedded LLMs within complex multi-agent pipelines, making it difficult, if not impossible, to determine whether failures stemmed from the model’s reasoning or from the surrounding system design. This project addresses that gap. The research team designed a controlled evaluation framework that removes confounding factors, including simplified agent architectures, deterministic tools and structured knowledge graphs, to expose the reasoning of large language models in isolation. They evaluated six open-source large language models across three agentic workflows and two real-world microservice datasets, covering 48,000 executed scenarios over 228 days of non-parallelized computation time.
Beyond measuring accuracy, they developed a labelled taxonomy of 16 reasoning failure types and used an LLM-as-a-judge evaluator to annotate over 3,000 inference traces. This work is the first empirical investigation into the isolated reasoning capabilities, and characteristic failure modes, of LLM-based root cause analysis agents.
The findings show that current open-source LLMs face substantial barriers to reliable cloud root cause analysis. Accuracy remains low across models and workflows, reasoning failures are pervasive, and adding agentic complexity often compounds errors rather than resolving them. These results indicate that progress will require improved specific reasoning capabilities, not simply increasing agentic complexity.
The results point to concrete directions: early hypothesis diversification, self-consistency and evidence-sufficiency checks, explicit domain guidance for triage and causal reasoning, and improved alert representation strategies, particularly for trace data. Understanding how reasoning failures co-occur and compound remains an important open direction, especially in multi-step and multi-agent settings where early missteps cascade.
This work argues that transparency in reasoning quality is as important as accuracy when evaluating root cause analysis agents. The failure taxonomy and LLM-as-a-judge evaluator developed in this research are reusable tools for diagnosing and improving LLM reasoning in complex system diagnosis and, by extension, in other reasoning-intensive agentic tasks.
To learn more about the award-winning research on which this article is based, please see Evelien Riddell, James Riddell, Gengyi Sun, Michał Antkiewicz and Krzysztof Czarnecki. Stalled, Biased, and Confused: Uncovering Reasoning Failures in LLMs for Cloud-Based Root Cause Analysis. Presented at the IEEE/ACM Third International Conference on AI Foundation Models and Software Engineering (FORGE ‘26), April 12–13, 2026, Rio de Janeiro, Brazil.