Please note: This PhD defence will take place online.
Siddhartha Sahu, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Semih Salihoğlu
Diverse applications spanning areas such as fraud detection, risk assessment, recommendations, and telecommunications process datasets characterized by entities and their relationships. Graphs naturally emerge as the most intuitive abstraction for modeling these datasets. Furthermore, many practical applications seek the capability of sharing computations across multiple snapshots of evolving graphs to efficiently perform analyses, for instance when evaluating changing road conditions in transportation networks or conducting contingency analysis on infrastructure networks. The research in this thesis is motivated by the challenge to efficiently support such applications on large-scale datasets.
Differential computation (DC) has emerged as a powerful general technique for incrementally maintaining computations across evolving datasets, even those containing arbitrarily nested loops. It is thus a promising technique that can be used to build the kind of applications that motivate this thesis. We present a study of DC that explores how it can be used to build practical data systems. In particular, this thesis addresses two challenges that impede the adoption of DC: (i) the lack of high-level interfaces that can be used to develop graph-specific applications; and (ii) scalability challenges that arise due to the general maintenance technique used by DC, making it less efficient for application-specific workloads.
The key contribution of this thesis is to show how DC can be made more practical for graph analytics systems. To address the lack of high-level interfaces, we built GraphSurge, a system that can be used to create and analyze multiple views over static graphs using a declarative programming interface. When users run graph computations on a collection of views, GraphSurge internally uses DC to share computation across all views. We develop several optimizations that improve the scalability of DC. Within GraphSurge, we identify two optimization problems, which we call collection ordering and collection splitting and present algorithms to solve these problems. These optimizations improve the runtime of GraphSurge applications by up to an order of magnitude. In the reference implementation of DC, we show two design bottlenecks in how data is indexed and processed within operators. We implement a new index and an optimization called Fast Empty Difference Verification to address the bottlenecks, that are able to improve the runtime of graph analytics workloads by up to 14x. Our work was informed by insights from a non-technical user survey that we conducted to understand how graphs are used in practice.