Fast String Sorting using Order Preserving Compression
[.ps]
[.pdf]
Abstract
We give experimental evidence for the benefits of order preserving compression in sorting algorithms. While in general any algorithm might benefit from compressed data due to reduced paging requirements, we identified two natural candidates that would further benefit from order preserving compression, namely string-oriented sorting algorithms and word-RAM algorithms for keys of bounded length. The word-RAM model has some of the fastest known sorting algorithms in practice. These algorithms are designed for keys of bounded length, usually 32 or 64 bits, which limits their direct applicability for strings. One possibility is to use an order preserving compression scheme, so that a bounded-key-length algorithm can be applied. For the case of standard algorithms we took what is considered to be the among the fastest non-word RAM string sorting algorithms, Fast MKQSort, and measured its performance on compressed data. The Fast MKQSort algorithm of Bentley and Sedgewick is optimized to handle text strings. Our experiments show that order compression techniques results in savings of approx. 15% over the same algorithm on non-compressed data. For the word-RAM we modified Andersson's sorting algorithm to handle variable length keys. The resulting algorithm is faster than the standard Unix sort by a factor of 1.5x. Lastly we used an order preserving scheme that is within a constant additive term of the optimal Hu-Tucker but requires linear time rather than O(m log m) where m is the size of the alphabet.
Bibtex Entry