Cheriton School of Computer Science researchers study catastrophic effects network failures have on cloud-computing systems | Cheriton School of Computer Science

Computer clusters power everything from Google and Facebook to online retail and banking. They’re comprised of hundreds or even thousands of machines connected together by networks, typically in a vast data centre.

“It’s a complex system,” says Cheriton School of Computer Science Professor Samer Al-Kiswany. “In a big cluster, you have thousands of computers connected by hundreds of switches and routers — switching devices that forward traffic between the nodes — and you have complex software to manage the switches and routers. Systems need to be able to tolerate not just node failure, but network failure as well.”

photo of Hatem Takruri, Ahmed Alquraan, Mohammed Alfatafta and Samer Al-Kiswany

L to R: Graduate students Hatem Takruri, Ahmed Alquraan, Mohammed Alfatafta and Professor Samer Al-Kiswany from the Cheriton School of Computer Science’s Systems and Networking research group model different types of network partitioning failures.

Given their complexity, it may not be surprising that network failures do happen. A network can be split into halves such that one part of the cluster cannot communicate with the other part.

Reports from Google, Microsoft and Hewlett-Packard show that network partitioning failures are common, happening as often as once every four days, and that they contribute significantly to system downtime. But programmers who build software services that use large clusters often assume the network is reliable, and if the network fails it will have a minor impact on services, leading perhaps to just a brief loss of availability.

To better understand the nature and impact of network partitioning, Ahmed Alquraan, a recent master’s graduate in Al-Kiswany’s group, conducted an in-depth study of 136 network-partitioning failures from 25 widely used distributed systems to answer three key questions: What’s the impact of network-related failures? What are the characteristics of these failures? And what changes to the current designs, testing and deployment practices are necessary to improve fault tolerance?

“The majority of failures lead to catastrophic effects — data loss, reappearance of deleted data, system crashes, data corruption and broken locks, which can allow double access,” said Alquraan, who will be a Cheriton Graduate Scholarship holder beginning PhD studies in January 2019. “Normally, two tellers cannot modify a person’s bank account at same time because that would corrupt the account’s value. But under network partitions, double access is possible.”

“The consequences of network partition failure depend on the system and the type of partition,” said master’s student Mohammed Alfatafta, one of the study’s researchers in Al-Kiswany’s group. “We found that the majority of production systems — the kind of systems used by banks and large companies — cannot tolerate these failures. The results could be as significant as data loss or reads of values that are not up to date. These are dangerous failures that cause real problems.”

“We also identified a special type of network partitioning — partial partitions — where some nodes cannot talk to some nodes, but the rest of the cluster can communicate with the two disconnected nodes,” Alquraan added. “We found that partial partitions are poorly understood and tested in systems. Further research is needed to better understand how this type of fault happens and to build effective fault-tolerance into a system.”

To address this discrepancy, master’s student Hatem Takruri, also in Al-Kiswany’s group, built a testing framework called NEAT — short for network partitioning testing framework — that can help developers test the resiliency of their systems to network partitioning failures.

“NEAT deliberately splits the network between specific nodes so we can see what the result will be,” Takruri said. “We used NEAT to test seven systems and we found 32 failures that caused data loss, reappearance of deleted data, system unavailability, double locking and broken locks.”

The study concludes with a list of common pitfalls, highlights of common flaws in current designs, and presents a set of recommendations for system designers, developers, testers and network administrators. The team’s study was presented at the 13^thUSENIX Symposium on Operating Systems Design and Implementation, which was held in Carlsbad, California, from October 8–10, 2018.

To learn more about this research, please see Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer Al-Kiswany, An Analysis of Network-Partitioning Failures in Cloud Systems, Proceedings of the Symposium on Operating Systems Design and Implementation, October 2018.