Kareem El Gebaly, PhD candidate
David R. Cheriton School of Computer Science
The process of analyzing relational data typically involves tasks that facilitate gaining familiarity or insights and coming up with findings or conclusions based on the data. This process is usually practiced by data experts, such as data scientists, who share their output with potentially less expert audience (everyone). Our goal is to enable everyone to take a part in analyzing data rather than passively consuming its outputs (analytics democratization). With today's increasing availability of data (data democratization) on the internet (web) combined with an already wide spread personal computing capabilities such a goal is becoming more attainable. With the recent increase of public data, i.e., Open Data, non-data experts, such as the data journalists are keener than ever to analyze new data sets that are relevant to wide sectors of society. An important example of Open Data is the data released by governments all over the world, i.e., Open Government.
This dissertation, focuses on two main challenges that would face data exploration scenarios such as data journalism for open data found over the web. First, infrastructure necessary for interactive data exploration is costly and hard to manage, especially in data journalism use cases where users potentially have no technical knowledge. Second, their audiences need guidance because they would not know where to start the data exploration since there are too many starting points.
One common theme among many data exploration tasks revolves around navigating the many different ways to group the data, i.e., exploring the data cube. Thus, to guide the user through data exploration, we introduce an information-theoretic technique that picks the most informative parts from the entire data cube of a relational table (Explanation tables). We introduce a sampling based technique for generating explanation tables that achieves comparable quality to an exhaustive technique that considers the entire data cube, with a significant reduction in the run time. In addition, we introduce optimizations that allow for generating explanation tables under the modest resources available in the browser, again, without any external dependencies.
In this, we present an SQL engine and a data exploration guidance tool that run entirely in the browser. We view the techniques and the experiments presented here as a fully functional and open sourced proof of viability of our proposal. Our analytical stack is portable, and works entirely in the browser. We show that SQL and exploration guidance can be as accessible as a web page, which opens the opportunity for more people to analyze data sets. Facilitating data exploration for everyone is one step closer towards analytics democratization where everyone can participate in data exploration not just the experts.
200 University Avenue West
Waterloo, ON N2L 3G1