Ashvin
Goel,
Associate
Professor
Electrical
and
Computer
Engineering
and
Computer
Science,
University
of
Toronto
Databases commonly use primary-backup replication schemes for fault tolerance and disaster recovery. These schemes raise significant challenges for modern, in-memory databases, which generate high transaction rates, and recovery logs at close to memory bandwidth. It is hard to replay the recovery log scalably on the backup, making the backup a bottleneck. Moreover, the log transfer can cause network bottlenecks. Both these bottlenecks can significantly slow the primary database.
This work proposes addressing these problems by using record-replay for replicating fast databases. Our design enables replay to be performed scalably and concurrently, allowing the backup performance to scale with the primary. At the same time, our approach requires only 15-20% of the network bandwidth required by traditional logging, reducing network infrastructure costs significantly.