Please note: This PhD defence will be given online.
Sajjad Rizvi, PhD candidate
David R. Cheriton School of Computer Science
Supervisors: Professors Bernard Wong, Srinivasan Keshav
Big data applications put significant latency and throughput demands on distributed storage systems. Meeting these demands requires storage systems to use a significant amount of infrastructure resources, such as network capacity and storage devices. Resource demands largely depend on the workloads and can vary significantly over time. Moreover, demand hotspots can move rapidly between different infrastructure locations.
Existing storage systems are largely infrastructure-oblivious as they are designed to support a broad range of hardware and deployment scenarios. Most only use basic configuration information about the infrastructure to make important placement and routing decisions. In the case of cloud-based storage systems, cloud services have their own infrastructure-specific limitations, such as minimum request sizes and maximum number of concurrent requests. By ignoring infrastructure-specific details, these storage systems are unable to react to resource demand changes and may have additional inefficiencies from performing redundant network operations. As a result, provisioning enough resources for these systems to address all possible workloads and scenarios would be cost prohibitive.
This thesis studies the performance problems in commonly used distributed storage systems and introduces novel infrastructure-aware design methods to improve their performance. First, it addresses the problem of slow reads due to network congestion that is induced by disjoint replica and path selection. Selecting a read replica separately from the network path can perform poorly if all paths to the pre-selected endpoints are congested. Second, this thesis looks at scalability limitations of consensus protocols that are commonly used in geo- distributed key value stores and distributed ledgers. Due to their network-oblivious designs, existing protocols redundantly communicate over highly oversubscribed WAN links, which poorly utilize network resources and limits consistent replication at large scale. Finally, this thesis addresses the need for a cloud-specific realtime storage system for capital market use cases. Public cloud infrastructures provide feature-rich and cost-effective storage services. However, existing realtime timeseries databases are not built to take advantage of cloud storage services. Therefore, they do not effectively utilize cloud services to provide high performance while minimizing deployment cost.
This thesis presents three systems that address these problems by using infrastructure-aware design methods. Our performance evaluation of these systems shows that infrastructure-aware design is highly effective in improving the performance of large scale distributed storage systems.
200 University Avenue West
Waterloo, ON N2L 3G1