Please note: This PhD seminar will take place in DC 1302.
Soheil Johari, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Raouf Boutaba
Root cause analysis (RCA) is a critical capability for operating modern distributed and networked systems (e.g., cloud platforms), where faults propagate across tightly coupled components and manifest as correlated telemetry anomalies. Recent RCA methods based on causal discovery provide interpretable dependency structures, but they often fail in practice due to two fundamental challenges: limited fault-time monitoring data, and incomplete observability of system components.
We propose a novel causal RCA framework that explicitly addresses these challenges by integrating structural priors with latent cause inference. The key novelty lies in decoupling structural learning from fault-time diagnosis and leveraging this structure to guide inference under data scarcity. Specifically, we first learn a Partial Ancestral Graph (PAG) from abundant normal-operation telemetry, capturing stable causal dependencies while encoding uncertainty due to latent confounders. At fault time, this learned structure constrains the causal search space, significantly improving robustness and statistical reliability with limited data samples. In addition, we introduce a latent root cause reasoning mechanism that systematically elevates hidden candidates when observed anomaly patterns indicate unobserved common causes, enabling RCA beyond the monitored system boundary.
We evaluate the proposed framework on two public benchmarks, including a dataset emulating an end-to-end virtualized 5G network and the Sock-Shop microservice benchmark. Experimental results demonstrate up to 35% and 33% improvements in Top-3 and Top-5 RCA accuracy, respectively, compared to state-of-the-art approaches, along with a 4× reduction in computational cost. Moreover, the framework successfully identifies root causes that are not directly observable by leveraging latent structures and tracing their effects through observable proxies.