Making Big Data Interactive with Spark

Friday, July 4, 2014 2:00 pm - 2:00 pm EDT (GMT -04:00)

Computer Science alumnus Matei Zaharia comes back to the University of Waterloo to discuss Spark - a single programming model for big data sets - and the industry applications of the most active project in the Apache big data ecosystem.

Talk summary:
The rapid growth in data volumes requires new computer systems that scale out across hundreds of machines. While early programming models, such as MapReduce, handled large-scale batch processing, the demands on these systems have also grown: in particular, users quickly needed to run (1) more interactive ad-hoc queries, (2) more complex multi-pass algorithms (e.g. machine learning), and (3) real-time processing on large data streams.

In this talk, we present a single programming model,
resilient distributed datasets (RDDs), that supports all of these
emerging workloads. RDDs form the basis of Apache Spark, an open source cluster computing system that supports real-time and sophisticated analytics on big data. Spark runs up to 100x faster than previous systems like Hadoop MapReduce, while offering clean, easy-to-use interfaces in Java, Scala and Python. Spark has quickly become the most active project in the Apache big data ecosystem, with over 100
developers contributing in the past year, and we will cover industry
applications as well as the ideas behind the project.
About Matei Zaharia:
Matei Zaharia is an assistant professor at MIT and CTO at
Databricks, the startup company commercializing Spark. He got his
undergraduate degree at the University of Waterloo and then his PhD at
UC Berkeley. While at Berkeley, Matei started Spark as a research
project.