Friday, July 4, 2014 2:00 pm
-
2:00 pm
EDT (GMT -04:00)
Computer Science alumnus Matei Zaharia comes back to the University of Waterloo to discuss Spark - a single programming model for big data sets - and the industry applications of the most active project in the Apache big data ecosystem.
Talk
summary:
The
rapid
growth
in
data
volumes
requires
new
computer
systems
that
scale
out
across
hundreds
of
machines.
While
early
programming
models, such
as
MapReduce,
handled
large-scale
batch
processing,
the
demands
on
these
systems
have
also
grown:
in
particular,
users
quickly
needed
to
run
(1)
more
interactive
ad-hoc
queries,
(2)
more
complex
multi-pass
algorithms
(e.g.
machine
learning),
and
(3)
real-time
processing
on
large
data
streams.
In
this
talk,
we
present
a
single
programming
model,
resilient
distributed
datasets
(RDDs),
that
supports
all
of
these
emerging
workloads.
RDDs
form
the
basis
of
Apache
Spark,
an
open
source
cluster
computing
system
that
supports
real-time
and
sophisticated analytics
on
big
data.
Spark
runs
up
to
100x
faster
than
previous systems
like
Hadoop
MapReduce,
while
offering
clean,
easy-to-use interfaces
in
Java,
Scala
and
Python.
Spark
has
quickly
become
the
most
active
project
in
the
Apache
big
data
ecosystem,
with
over
100
developers
contributing
in
the
past
year,
and
we
will
cover
industry
applications
as
well
as
the
ideas
behind
the
project.
About
Matei
Zaharia:
Matei
Zaharia
is
an
assistant
professor
at
MIT
and
CTO
at
Databricks,
the
startup
company
commercializing
Spark.
He
got
his
undergraduate
degree
at
the
University
of
Waterloo
and
then
his
PhD
at
UC
Berkeley.
While
at
Berkeley,
Matei
started
Spark
as
a
research
project.