Please note: This master’s thesis presentation will take place online.
Abdallah Elshamy, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Samer Al-Kiswany
Scientific workflows are typically data-intensive. They consist of many stages, each of which may contain hundreds to even thousands of tasks. Traditionally, scientific workflows have been executed using the serverful computing model. Serverless computing presents an attractive alternative to the serverful computing model as it frees developers from managing and provisioning resources and offers a fine-grained billing model.
In this work, we study the viability and evaluate the tradeoffs of using the serverless paradigm to run scientific workflows. We follow an empirical approach to evaluate the performance and cost benefits of this paradigm and to study the suitability of the current serverless software stack to support complex data-intensive scientific workflows. Specifically, we discuss, implement, and evaluate three orchestration approaches for executing scientific workflows: serverful-centralized, serverless-centralized, and serverless-decentralized. This work is the first to implement and evaluate a purely serverless orchestration approach that does not require deploying a dedicated workflow manager for scientific workflows.
Our evaluation shows that serverless orchestration approaches cause a noticeable performance overhead for some workflow patterns (e.g., reduce stages) due to accessing a large amount of remote data. We propose two optimizations (i.e., prefetching file privileges and container placement) that exploit data locality to mitigate that impact. Our evaluation with three scientific workflows—Montage, 1000Genomes, and SRA Search—shows that serverless-centralized and serverless-decentralized achieve a comparable performance to a serverful approach. Also, our results show that prefetching file privileges and container placement optimizations improve the performance by 32.6% and 44% respectively when compared to an unoptimized version for the Montage application. We also introduce a cost model to estimate which approach is cheaper for a given application and a cloud provider. Our cost analysis shows that answering this question depends on the characteristics of the workflow and the pricing of the cloud provider.
Bio: I am Abdallah, currently pursuing an MMath degree at the University of Waterloo under the supervision of Prof. Samer Al-Kiswany. My research focuses on bringing serverless computing to new frontiers. My thesis explores using serverless computing to execute scientific workflows, and different orchestration approaches to do so.