Master’s Thesis Presentation • Systems and Networking • Exploiting Zero-Entropy Data for Efficient Deduplication

Wednesday, September 10, 2025 2:00 pm - 3:00 pm EDT (GMT -04:00)

Please note: This master’s thesis presentation will take place in DC 2310.

Mu'men Al Jarah, Master’s candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Samer Al-Kiswany

As the volume of digital data continues to grow rapidly, efficient data reduction techniques, such as deduplication, are essential for managing storage and bandwidth. A key step in deduplication is file chunking, which is typically performed using Content-Defined Chunking (CDC) algorithms. While these algorithms have been studied under random data, their performance in the presence of zero-entropy data, where long sequences of identical bytes appear, has not been explored. Such zero-entropy data are common in real-world datasets and introduce challenges for CDC in deduplication systems.

This thesis studies the impact of zero-entropy data on the performance of both hash-based and hashless state-of-the-art CDC algorithms. The results show that existing algorithms, particularly hash-based ones, are inefficient at detecting and handling zero-entropy blocks, especially when these blocks are small, which reduces space savings. To address this issue, I propose ZERO (Zero-Entropy Region Optimization), a system that integrates into the deduplication pipeline. ZERO identifies and extracts zero-entropy blocks prior to chunking, compresses them using Run-Length Encoding (RLE), and stores their metadata for later reconstruction. ZERO improves deduplication space savings by up to 29% without impacting throughput.