Cheriton School of Computer Science researchers study catastrophic effects network failures have on cloud-computing systems

Tuesday, December 11, 2018

Computer clusters power everything from Google and Facebook to online retail and banking. They鈥檙e comprised of hundreds or even thousands of machines connected together by networks, typically in a vast data centre.聽

鈥淚t鈥檚 a complex system,鈥 says Cheriton School of Computer Science Professor . 鈥淚n a big cluster, you have thousands of computers connected by hundreds of switches and routers 鈥 switching devices that forward traffic between the nodes 鈥 and you have complex software to manage the switches and routers. Systems need to be able to tolerate not just node failure, but network failure as well.鈥

photo of Hatem Takruri, Ahmed Alquraan, Mohammed Alfatafta and Samer Al-Kiswany

L to R: Graduate students聽Hatem Takruri, Ahmed Alquraan, Mohammed Alfatafta and Professor Samer Al-Kiswany聽from the Cheriton School of Computer Science鈥檚聽聽research group model different types of network partitioning failures.

Given their complexity, it may not be surprising that network failures do happen. A network can be split into halves such that one part of the cluster cannot communicate with the other part.聽

Reports from Google, Microsoft and Hewlett-Packard show that network partitioning failures are common, happening as often as once every four days, and that they contribute significantly to system downtime. But programmers who build software services that use large clusters often assume the network is reliable, and if the network fails it will have a minor impact on services, leading perhaps to just a brief loss of availability.聽

To better understand the nature and impact of network partitioning, , a recent master鈥檚 graduate in Al-Kiswany鈥檚 group, conducted an in-depth study of 136 network-partitioning failures from 25 widely used distributed systems to answer three key questions: What鈥檚 the impact of network-related failures? What are the characteristics of these failures? And what changes to the current designs, testing and deployment practices are necessary to improve fault tolerance?

鈥淭he majority of failures lead to catastrophic effects 鈥 data loss, reappearance of deleted data, system crashes, data corruption and broken locks, which can allow double access,鈥 said Alquraan, who will be a聽聽holder beginning PhD studies in January 2019. 鈥淣ormally, two tellers cannot modify a person鈥檚 bank account at same time because that would corrupt the account鈥檚 value. But under network partitions, double access is possible.鈥

鈥淭he consequences of network partition failure depend on the system and the type of partition,鈥 said master鈥檚 student , one of the study鈥檚 researchers in Al-Kiswany鈥檚 group. 鈥淲e found that the majority of production systems 鈥 the kind of systems used by banks and large companies 鈥 cannot tolerate these failures. The results could be as significant as data loss or reads of values that are not up to date. These are dangerous failures that cause real problems.鈥

鈥淲e also identified a special type of network partitioning 鈥 partial partitions 鈥 where some nodes cannot talk to some nodes, but the rest of the cluster can communicate with the two disconnected nodes,鈥 Alquraan added. 鈥淲e found that partial partitions are poorly understood and tested in systems. Further research is needed to better understand how this type of fault happens and to build effective fault-tolerance into a system.鈥

To address this discrepancy, master鈥檚 student , also in Al-Kiswany鈥檚 group, built a testing framework called NEAT 鈥 short for聽network pa谤迟颈迟颈辞苍颈苍驳听testing framework 鈥 that can help developers test the resiliency of their systems to network partitioning failures.

鈥淣EAT deliberately splits the network between specific nodes so we can see what the result will be,鈥 Takruri said. 鈥淲e used NEAT to test seven systems and we found 32 failures that caused data loss, reappearance of deleted data, system unavailability, double locking and broken locks.鈥

The study concludes with a list of common pitfalls, highlights of common flaws in current designs, and presents a set of recommendations for system designers, developers, testers and network administrators. The team鈥檚 study was presented at the聽, which was held in Carlsbad, California, from October 8鈥10, 2018.聽


To learn more about this research, please see Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, and Samer Al-Kiswany,听,听Proceedings of the Symposium on Operating Systems Design and Implementation, October 2018.