Hemant
Saxena,
PhD
candidate
David
R.
Cheriton
School
of
Computer
Science
We address the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize data communication costs, especially in current shared-nothing settings.
To do this, we define a set of primitives for dependency discovery, which corresponds to data processing steps separated by communication barriers, and we present efficient implementations that optimize both computation and communication costs. Using real data, we show that algorithms built using our primitives are significantly faster and more communication-efficient than straightforward distributed implementations.