Semih Salihoglu
firname [dot] lastname [at] uwaterloo [dot] ca
DC 3351
Meeting times: Mon-Wed 9:00am
Meeting location: DC 2568
This is a project-based seminar course that is aimed to survey the different models that underly modern large-scale data processing systems, such as MapReduce, Spark, Pregel, Flink, Storm, Timely, and others. The goal is to identify the fundamental advantages and limitations of different models and demonstrate the systems and applications that are built on each model. Main topics include bulk synchronous processing, asynchronous systems, streaming systems, and timely and differential dataflow. For each model, we will study a theoretical model, the systems built on the model, and applications that are built on the systems. We will also cover two recent topics: disorderly distributed data-driven programming and coflow networking, which despite not lying directly in data management, address two exciting challenges that arise in distributed data processing.
Distributed data processing is an exciting research area, to which systems, theory, and application researchers all make significant contributions. Our goal in this seminar will be to cover papers from all three areas. This is not a theory seminar, but we will read some theoretical papers to understand the fundamentals of these systems.
One important goal of the seminar is to cover some of the good PhD theses that were written in the area in the last 5 years. We will discuss what made those theses impactful. So, unlike some other seminars, we will sometimes be reading thesis chapters.
Students are expected to understand the fundamentals of databases, algorithms, and operating systems each at least at the level of an introductory course.
The main workload will consist of a term-long project that students will do on one of the systems covered in the seminar. The other workload includes paper reviews and presentations. Detailed information for each piece is below. This is a seminar, so your participation in the class is very critical to everyone's learning. That is why it will be a significant part of your grade. Please ask questions and make comments throuhout the discussions that we will have. The workload pieces and mark breakdown is as follows:
For each class we will be writing a review of one of the papers assigned to that day (except the first and last classes). If there are more than one papers assigned, you can pick either one. You are allowed to skip 2 reviews throughout the term. The reviews will be at most 1 page long and roughly 2 paragraphs. You have to finish your review with one question that can start a discussion in the seminar. You are expected to (very very) briefly answer the following 6 questions and finish your reviews with a question that can start a discussion in class:
Each student will roughly be doing 2 presentations per term. When there are 2 presenters in a day, the presentations will be 20 minutes long. If you are the only presenter that they, then the presentation will be slighly longer, 25-30 minutes long. Here are the importatant points summarizing what you have to do for your presentations.
Date | Topic | Readings | Slides |
---|---|---|---|
12/09 | Overview & BSP | BSP: A bridging model for parallel computation | Semih |
14/09 | A BSP General Data Processing System | MapReduce | Semih |
19/09 | BSP Relational Dataflow & Datawarehouse Systems | Irene (Pig), Lei Yao (Hive) | |
21/09 | BSP Graph Processing Systems | Vivi (Pegasus), Chathura (Pregel) | |
26/09 | Infrastructure 1 | Amine (Google File System), Siddharta (Bigtable) | |
28/09 | Infrastructure 2 | Slavik (RAMCloud), Slavik (Dynamo) | |
03/10 | In-memory BSP General Data Processing System | Adi | |
05/10 | Benefits of Synchronization (Project Proposals Due) |
Yuwei (Upper & Lower Bounds) | |
10/10 | No CLASS | ||
14/10 | Costs of Synchronization 1: Skew | Hao (Skew Tune) | |
17/10 | Costs of Synchronization 2: Convergence; Asynchronous ML/Graph Processing | Saifuddin (Project Adam), Luke (PowerGraph) | |
19/10 | Costs of Synchronization 3-1: Latency/Throughput; Asynchronous OLTP Applications | Anil | |
24/10 | Costs of Synchronization 3-2: Latency/Throughput; Asynchronous OLTP Applications | Slavik | |
26/10 | Foundations of Streaming Systems | Weichen (CQL) | |
31/10 | Modern Distributed Streaming Systems | Anil (Storm), Saifuddin (MillWheel) | |
02/11 | Modern Streaming Infrastructures and Applications | Hao (Kafka), Luke (SAMOA) | |
07/11 | Timely Dataflow | Amine | |
09/11 | Differential Dataflow |
|
Chathura |
14/11 | Distributed Unorderly Data-Driven Programming 1 | Siddharta (Datalog), Vivi (Overlog) | |
16/11 | Distributed Unorderly Data-Driven Programming 2 | Irene | |
21/11 | Coflow Networking 1 | Adi | |
23/11 | Coflow Networking 2 | Yuwei | |
28/11 | TBA: Other Systems 1 | Weichen (TensorFlow) | |
30/11 | Other Systems 2 | Lei (TAO) | |
--- | Other Papers | --- | |
05/12 | Project Presentations | Pictures taken by Slavik |