Please note: This seminar will be given online.
Ram Alagappan, Postdoctoral researcher
VMware Research Group
Distributed storage systems form the core of modern cloud services. Like many systems software, these systems are built using layering: system designers use distributed protocols (e.g., Paxos, 2PC) and layer them upon local storage engines (e.g., RocksDB, SQLite). Such layering abstracts details about the storage stack to the layers above, easing development.
I will show that treating the storage stack as a black box, unfortunately, masks vital information, resulting in poor reliability and missed performance opportunities. I will then demonstrate that it is greatly beneficial to expose useful information across layers of a distributed storage system (while hiding unimportant details). In particular, I will show that reliability and performance can be significantly improved by co-designing distributed systems and storage stacks.
In this talk, I will focus on reliability and first show how problems in the storage layer can lead to disastrous outcomes like data loss, corruption, and unavailability in widely-used distributed storage systems. These outcomes arise due to a few fundamental issues in storage fault handling common across many systems. I then present CTRL, a new approach to recover from storage faults in distributed systems. CTRL co-designs the storage stack and the distributed layers to cooperate with each other to leverage cross-replica redundancy perform correct recovery. I implement CTRL in two practical systems and show that CTRL incurs negligible performance overhead while significantly improving resiliency to storage faults. Towards the end, I will briefly discuss how higher performance can be realized through a similar co-design.
Bio: Ram Alagappan is a postdoctoral researcher at the VMware Research Group. He earned his PhD, working with Professors Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau at the University of Wisconsin - Madison. His work has been published at top systems venues (SOSP, OSDI, FAST, EuroSys, and ATC), invited to journals (CACM and TOS), and has won three best paper awards (FAST 17, 18, and 20). His dissertation also won an honorable mention for the UW CS Best Dissertation.
His open-source frameworks have had a significant practical impact: these tools have exposed more than 80 severe bugs across 20 different widely used systems. Ideas from his work on CTRL have been adopted by a commercial database to make it resilient to storage faults.
To join this Systems and Networking seminar on Zoom, please go to https://uwaterloo.zoom.us/j/99449197983?pwd=NDFCbHV3dWZ3T1U5eXZXTTdhTU9qZz09.