Making Big Data Interactive with Spark | Cheriton School of Computer Science

Computer Science alumnus Matei Zaharia comes back to the University of Waterloo to discuss Spark - a single programming model for big data sets - and the industry applications of the most active project in the Apache big data ecosystem.

Talk summary:
The rapid growth in data volumes requires new computer systems that scale out across hundreds of machines. While early programming models, such as MapReduce, handled large-scale batch processing, the demands on these systems have also grown: in particular, users quickly needed to run (1) more interactive ad-hoc queries, (2) more complex multi-pass algorithms (e.g. machine learning), and (3) real-time processing on large data streams.

In this talk, we present a single programming model,

resilient distributed datasets (RDDs), that supports all of these

emerging workloads. RDDs form the basis of Apache Spark, an open source cluster computing system that supports real-time and sophisticated analytics on big data. Spark runs up to 100x faster than previous systems like Hadoop MapReduce, while offering clean, easy-to-use interfaces in Java, Scala and Python. Spark has quickly become the most active project in the Apache big data ecosystem, with over 100

developers contributing in the past year, and we will cover industry

applications as well as the ideas behind the project.

About Matei Zaharia:

Matei Zaharia is an assistant professor at MIT and CTO at

Databricks, the startup company commercializing Spark. He got his

undergraduate degree at the University of Waterloo and then his PhD at

UC Berkeley. While at Berkeley, Matei started Spark as a research

project.

Location Information

Location Address: AL - Arts Lecture Hall
200 University Avenue West
Waterloo, ON, CA N2L 3G1

Location coordinates:

Link to map: https://uwaterloo.ca/map/