Current Research

Research Overview

The main thrust of my current research deals with space efficient data structures. For example, several years ago we were interesting in developing highly efficient techniques for preprocessing large text files (in particular the Oxford English Dictionary). A tree based method was developed that permitted one to find (references to) all occurrences of a given phrase in what amounted to a scan of the phrase (independent of the size of the document). Unfortunately, the index was 4 or 5 times the size of the original file, and this was unacceptable. Most of this space was for pointers in a binary trie (or tree); for each byte of the original file were two pointers in the representation of the tree. There are about 4ⁿ trees on n nodes, so, at least in principle, one needs only 2n bits to represent an arbitrary tree, … not 2n pointers. The issue is that one requires a succinct representation that permits finding the children of a given node quickly. Several of my papers have focussed on this and related issues. Indeed we have found (practical) tree representations that take (virtually) the information theoretic minimum space and permit the navigational operations find_parent, find_child, find_subtree_size, etc. in constant time.

Another line of work is motivated by queries such as unions and intersections of several relations in large databases. The premise is that one would like to avoid looking at all the data, … even avoid looking at all of the answer. The expectation is that these sets are represented by B-trees or similar structures. The goal is to produce the results of the operations in the same representations as the inputs as efficiently as possible, and to prove one cannot do better.

Throughout this work issues of memory management keep coming up. Indeed another line of algorithmic work deals with this issue explicitly. Recent papers have dealt with growing arrays and vectors (i.e. new elements can be put anywhere, not just at the end) that can be accessed quickly without much overhead of extra memory.