Milad (Enayatallah) Ghaznavi, PhD candidate
David R. Cheriton School of Computer Science
Traffic in enterprise networks typically traverses a sequence of middleboxes forming a service function chain, or simply a chain. The ability to tolerate failures when they occur along chains is imperative to the availability and reliability of enterprise applications. Service outages due to chain failures severely impact customers and cause significant financial losses. Making a chain fault-tolerant is challenging since, in the case of failures, the state of faulty middleboxes must be correctly and quickly recovered while providing high throughput and low latency.
We present FTC, a novel system design and protocol for fault-tolerant service function chaining that guarantees strong consistency with up to f middlebox failures for chains of length f+1 or longer without requiring dedicated replica nodes. FTC uses the notion of transactional packet processing to retain consistent information during the normal operation of a chain for a correct recovery from failures. Our protocol piggybacks this information on packets and replicates it at other on-path middleboxes, taking advantage of their natural chain structure. We implement and evaluate a prototype of FTC. Our results show that FTC can achieve 9.5Mpps and adds only per middlebox 6-13% throughput overhead and 20μs latency overhead for a chain of 2–5 middleboxes. Our system recovers lost state in ~271ms in a distributed Cloud.