PhD Seminar • Systems and Networking • Distributed DNN Training on Serverless Resources

Friday, July 15, 2022 2:00 pm - 3:00 pm EDT (GMT -04:00)

Please note: This PhD seminar will be given online.

Runsheng (Benson) Guo, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Khuzaima Daudjee

Deep Neural Networks (DNNs) are often trained in parallel on a cluster of virtual machines (VMs) so as to reduce training time. However, this requires explicit cluster management, which is cumbersome and often results in costly over-provisioning of resources. Training DNNs on serverless compute is an attractive alternative that is receiving growing interest. In a serverless environment, cluster management is handled for the user, compute resources can be scaled at a fine-grained level, and users are billed only for resources that are consumed. Despite these advantages, existing serverless systems for DNN training are ineffective because they are limited to CPU-based training and bottlenecked by expensive distributed communication.

I will present Hydrozoa, a serverless system we have developed for distributed DNN training. Hydrozoa overcomes existing limitations of serverless DNN training with a novel architecture that combines serverless containers with hybrid-parallel training and supports dynamic worker scaling, which helps improve statistical training efficiency. Hydrozoa achieves significant throughput-per-dollar improvements over existing VM-based and serverless training approaches while relieving the user from the burden of managing machine clusters.


Bio: Runsheng (Benson) Guo is a PhD candidate whose research interests are in distributed ML training, serverless & cloud computing.