Please note: This master’s thesis presentation will take place in E7 5419 and online.
Evelien Riddell, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Krzysztof Czarnecki
Root cause analysis (RCA) is essential for diagnosing failures within complex software systems to ensure system reliability. The highly distributed and interdependent nature of modern cloud-based systems often complicates RCA efforts, particularly for multi-hop fault propagation, where symptoms appear far from their true causes. Recent advancements in Large Language Models (LLMs) present new opportunities to enhance automated RCA. In particular, LLM-based agents offer autonomous execution and dynamic adaptability with minimal human intervention. However, their practical value for RCA depends on the fidelity of reasoning and decision-making. Existing work relies on historical incident corpora, operates directly on high-volume telemetry beyond current LLM capacity, or embeds reasoning inside complex multi-agent pipelines — conditions that obscure whether failures arise from reasoning itself or from peripheral design choices.
In this thesis, we present a focused empirical evaluation that isolates an LLM’s reasoning behaviour. We design a controlled experimental framework that foregrounds the LLM by using a simplified experimental setting. We evaluate six LLMs under two agentic workflows (ReAct and Plan-and-Execute) and a non-agentic baseline on two real-world case studies (GAIA and OpenRCA). In total, we executed 48,000 simulated failure scenarios, totalling 228 days of execution time. We measure both root-cause accuracy and the quality of intermediate reasoning traces. We produce a labelled taxonomy of 16 common RCA reasoning failures and use an LLM-as-a-Judge for annotation. Our results clarify where current open-source LLMs succeed and fail in multi-hop RCA, quantify sensitivity to input data modalities, and identify reasoning failures that predict final correctness. Together, these contributions provide transparent and reproducible empirical results and a failure taxonomy to guide future work on reasoning-driven system diagnosis.
To attend this master’s thesis presentation in person, please go to E7 5419. You can also attend virtually on MS Teams.