Ashvin Goel, Associate Professor
Electrical and Computer Engineering and Computer Science, University of Toronto
Databases commonly use primary-backup replication schemes for fault tolerance and disaster recovery. These schemes raise significant challenges for modern, in-memory databases, which generate high transaction rates, and recovery logs at close to memory bandwidth. It is hard to replay the recovery log scalably on the backup, making the backup a bottleneck. Moreover, the log transfer can cause network bottlenecks. Both these bottlenecks can significantly slow the primary database.
This work proposes addressing these problems by using record-replay for replicating fast databases. Our design enables replay to be performed scalably and concurrently, allowing the backup performance to scale with the primary. At the same time, our approach requires only 15-20% of the network bandwidth required by traditional logging, reducing network infrastructure costs significantly.