PLEASE NOTE: THIS TALK IS CANCELLED
Siddhartha Sahu, PhD candidate
David R. Cheriton School of Computer Science
Developers of a variety of applications, such as fraud and threat detection, risk assessment, and recommendations, naturally model the connected data as graphs. Many of these applications require the ability to analyze different snapshots or views of a large-scale graph. Broadly, the views these applications require can be divided into two categories: (i) aggregate views that aggregate multiple nodes and edges into aggregate nodes and edges that are informally called supernodes and superedges; and (ii) filtered views that filter a subset of the edges and nodes in the original graph based on some predicates. Many of these applications construct many, sometimes thousands of, views of the same graph, and compute the same computation across each view. As such, there is an opportunity for sharing of computation in both during the construction of these views and when the same analytics computation is executed over a set of views.
Our first contribution is the design and implementation of a new open-source system we call GraphSurge, which treats graph views as first-class citizens and supports a wider range of view-based applications than prior work. GraphSurge's view definition language (GVDL), supports two organizational structures: (i) aggregate cubes and (ii) multidimensional differential cubes, which is the second contribution of our work and the focus of this paper. Our aggregate and differential cubes enable computation sharing both when constructing multiple views and when sharing computation across views.
We have developed GraphSurge on top of the Timely Dataflow system and its Differential Dataflow layer. Differential Dataflow is a system for general dataflow processing based on the differential computation model, which allows sharing of computation for evolving datasets. Users write analytics computations in GraphSurge as Differential Dataflow programs. When running the same computation on views that are organized as differential cubes, GraphSurge represents the views as partially ordered versions of a single dataset, and uses Differential Dataflow's differential computation capability to efficiently share computation.