CS 848/858: Models and Applications of Distributed Data Processing Systems

(Fall 2016)

Semih Salihoglu
firname [dot] lastname [at] uwaterloo [dot] ca
DC 3351
Meeting times: Mon-Wed 9:00am
Meeting location: DC 2568



This is a project-based seminar course that is aimed to survey the different models that underly modern large-scale data processing systems, such as MapReduce, Spark, Pregel, Flink, Storm, Timely, and others. The goal is to identify the fundamental advantages and limitations of different models and demonstrate the systems and applications that are built on each model. Main topics include bulk synchronous processing, asynchronous systems, streaming systems, and timely and differential dataflow. For each model, we will study a theoretical model, the systems built on the model, and applications that are built on the systems. We will also cover two recent topics: disorderly distributed data-driven programming and coflow networking, which despite not lying directly in data management, address two exciting challenges that arise in distributed data processing.

Distributed data processing is an exciting research area, to which systems, theory, and application researchers all make significant contributions. Our goal in this seminar will be to cover papers from all three areas. This is not a theory seminar, but we will read some theoretical papers to understand the fundamentals of these systems.

One important goal of the seminar is to cover some of the good PhD theses that were written in the area in the last 5 years. We will discuss what made those theses impactful. So, unlike some other seminars, we will sometimes be reading thesis chapters.


Students are expected to understand the fundamentals of databases, algorithms, and operating systems each at least at the level of an introductory course.

Workload & Mark Breakdown

The main workload will consist of a term-long project that students will do on one of the systems covered in the seminar. The other workload includes paper reviews and presentations. Detailed information for each piece is below. This is a seminar, so your participation in the class is very critical to everyone's learning. That is why it will be a significant part of your grade. Please ask questions and make comments throuhout the discussions that we will have. The workload pieces and mark breakdown is as follows:


The students are encouraged to come up with their own projects. However I am happy to suggest some projects as well. The projects are intended to be mini-research projects in a topic of the course. Example formats include (but not limited to): The most important thing is to pick a project on a topic you really like! You will meet with me several times during the term to discuss your progress. You have about 4 weeks to decide on the project. By October 5th you need to give a 2 page proposal of your project (the paper review format).

Project Deliverables

There are three main deliverables of your project.

Paper Reviews

For each class we will be writing a review of one of the papers assigned to that day (except the first and last classes). If there are more than one papers assigned, you can pick either one. You are allowed to skip 2 reviews throughout the term. The reviews will be at most 1 page long and roughly 2 paragraphs. You have to finish your review with one question that can start a discussion in the seminar. You are expected to (very very) briefly answer the following 6 questions and finish your reviews with a question that can start a discussion in class:

The first 4 of these questions are from my PhD advisor's tips for writing introductions to technical papers. I strongly recommend that each one of you read this entire document very very carefully (probably multiple times) at some point in your graduate studies. There is no fixed format for the reviews but I recommend: Single column, 1.5 space, 12 pt, in Latex. You will submit your reviews at the end of each class.


Each student will roughly be doing 2 presentations per term. When there are 2 presenters in a day, the presentations will be 20 minutes long. If you are the only presenter that they, then the presentation will be slighly longer, 25-30 minutes long. Here are the importatant points summarizing what you have to do for your presentations.

Schedule and Papers (TBA)

12/09 Overview & BSP BSP: A bridging model for parallel computation Semih
14/09 A BSP General Data Processing System MapReduce Semih
19/09 BSP Relational Dataflow & Datawarehouse Systems Irene (Pig), Lei Yao (Hive)
21/09 BSP Graph Processing Systems Vivi (Pegasus), Chathura (Pregel)
26/09 Infrastructure 1 Amine (Google File System), Siddharta (Bigtable)
28/09 Infrastructure 2 Slavik (RAMCloud), Slavik (Dynamo)
03/10 In-memory BSP General Data Processing System Adi
05/10 Benefits of Synchronization

(Project Proposals Due)

Yuwei (Upper & Lower Bounds)
10/10 No CLASS
14/10 Costs of Synchronization 1: Skew Hao (Skew Tune)
17/10 Costs of Synchronization 2: Convergence; Asynchronous ML/Graph Processing Saifuddin (Project Adam), Luke (PowerGraph)
19/10 Costs of Synchronization 3-1: Latency/Throughput; Asynchronous OLTP Applications Anil
24/10 Costs of Synchronization 3-2: Latency/Throughput; Asynchronous OLTP Applications Slavik
26/10 Foundations of Streaming Systems Weichen (CQL)
31/10 Modern Distributed Streaming Systems Anil (Storm), Saifuddin (MillWheel)
02/11 Modern Streaming Infrastructures and Applications Hao (Kafka), Luke (SAMOA)
07/11 Timely Dataflow Amine
09/11 Differential Dataflow Chathura
14/11 Distributed Unorderly Data-Driven Programming 1 Siddharta (Datalog), Vivi (Overlog)
16/11 Distributed Unorderly Data-Driven Programming 2 Irene
21/11 Coflow Networking 1 Adi
23/11 Coflow Networking 2 Yuwei
28/11 TBA: Other Systems 1 Weichen (TensorFlow)
30/11 Other Systems 2 Lei (TAO)
--- Other Papers ---
05/12 Project Presentations Pictures taken by Slavik