The exponential growth of computing has had as a consequence a large increase in the scale of the problems addressed by computer scientists. For example, twenty years ago a few hundreds of megabytes constituted a substantial text corpus. By the early 1990's, efficiently handling of a few gigabytes was a reasonable goal. Nowadays search engines index within a corporation measure data in terabytes of data for specialized systems and are rapidly moving toward this figure for general business practice. In other instances, we have seen similar growth in the size of large data collections such as those derived from internet traffic management, genomic data, and sensor networks. My research has focused in areas of application–such as these–in which the impact of an improved theoretical solution can be most felt. For example, in the context of constraint propagation [1] (see Constraint Programming below) we proposed a novel linear time algorithm that improved on the best known Ө(n2) algorithm used in practice till then. Experimentally we determined the leading constant of the Ө(n) speedup factor to be 1/2. This means that in a problem of size 100,000 the time required to solve it went down from about a year of computation to approximately ten minutes.

My research in general has as a goal the study of real life applications that (i) are amenable to theoretical analysis and (ii) lead to deeper understanding of the nature of computation and information processing. Here’s a selection of the problems I do research on (for a full list see my list of publications):


The Internet and the World Wide Web

We study the effects of caching int the network. 1996 we proposed a push-cache protocol for HTTP based on geographic and topological configurations. The simulations presented were later validated when similar schemes were implemented by content distribution networks (such as Akamai) on the public Internet.

Another area of study is the introduction of performance based routing using an overlay network over the public, BGP-based, Internet infrastructure. This work has various aspects, including Internet tomography and traffic balancing, and was incorporated into Internap Network Services routing tool AS/3 (lead developer: O. Stiffelman).

Top

Text-dominated databases/search engines

The WWW, corporate emails or government legislation and public policy documents are typical examples of vast text collections. Aside from text, large data collections are also generated by genomic analysis, trading, sensors, electronic transactions and other such. Our goal is to propose improved methods for users to effectively search these collections and collate the results obtained. In particular we study algorithms for fast composition of a result set from the document sets corresponding to query terms. Faster algorithms also free up resources to be used for more intelligent processing of the result set. These sets are typically several million entries in size. We revisited algorithms for set operations and proposed an adaptive algorithm and shows that its running time is within a constant factor of optimal with respect to a natural measure of the difficulty of an instance.

Another problem confronted by search engine designers is the sheer size of the index structure. If the text is available it is in fact possible to do away with the index altogether, at a penalty in performance, which becomes linear over the size of the corpus (like grep). In this case we showed that any algorithm with performance linear over the size of the pattern (indeed subquadratic) requires a linear size index, even if the text is available. Top

Online Algorithms and Motion Planning

Online analysis is the study of algorithms that must make irrevocable decisions in the presence of partial knowledge on the input. This framework can model a variety of practical settings. A mobile robot for example, often navigates a changing, partially known environment, such as a robot moving supplies in a shop floor from the warehouse to milling machines. Initially my research focused on solving special cases of this problem, then proceeded to extend the basic search primitives so that they apply to more general settings. One of these primitives is the linear line search problem which consists of a robot exploring to semi-infinite rays. This problem can be generalized to the exploration of m-rays by a single robot, and of p robots exploring m-rays. Kao et al. observed that these geometric search primitives apply to non-geometric searches, such as a set of computer heuristics searching for the solution to a computationally hard problem. Based on this idea we generalized the results of p robots and m rays to optimally solve the case of m computers working simultaneously on n problems in the context of contract algorithms.

In terms of other on-line problems we also considered the effects of caching effects in the memory hierarchy. The cache-oblivious model allows programmers to reason about a two-level memory hierarchy while proving results that hold on a multilevel memory hierarchy. We found exact bounds on the time required for searching on the cache-oblivious model. Our new search structure showed that in certain settings uneven splits in a divide-and-conquer paradigm can, surprisingly, yield better performance. In future research we will study the applicability of these ideas to other settings such as cache-oblivious matrix multiplication.

It has long been known in the field of online algorithms that the standard measure of performance, namely the competitive ratio, in some cases leads to results that do not properly reflect the real life performance of the strategy in question. Many researchers have addressed this question as it relates to paging with partial degrees of success. In this paper, building upon the work of Albers et al. we introduced a natural measure of complexity that fully reflects the empirical separation of LRU (Least Recently Used) from all other memory paging strategies. This is the first model in which strict separation between LRU and all other strategies can be proven thus settling a long standing conjecture. The model extends to other on-line settings, which will be the subject of forthcoming papers.

The combined study of these various on-line problems brought about a reconsideration of the competitive ratio under an adversary model as the main technique of analysis. While this is the most common metric in on-line algorithm analysis, it produces pessimistic measures for certain types of problems, including paging algorithms and general polygonal exploration. This is an observation shared by several other researchers (see our survey on the subject). The shortcomings of the competitive ratio led us to rethink the entire foundations of competitive analysis. A further generalization to other settings, which we have termed the cooperative ratio, also seems a good candidate to produce measurements of online algorithms that better reflect actual practice. We are optimistic about the outcome of this research. Top

Constraint Programming

Constraint programming aims to solve computationally intractable problems within the best (exponential) time possible. In particular constraint propagation aims to assist an exhaustive search by detecting as early as possible when a partial solution to a problem is no longer self-consistent. This has been the subject of intensive study in the Artificial Intelligence community and its results are widely used in practice (e.g. the ILOG solver). The alldifferent constraint is an important scheduling problem with numerous applications. For example, devising the best schedule assigning airplanes to gates at an airport is an instance of the alldifferent constraint. Kurt Melhorn and Sven Thiel gave a theoretical algorithm for this problem, but the solution was complex and thus had yet to be implemented. We introduced advanced data structure techniques to speedup the computation of a solution. This opened a new avenue of attack for computing consistency of other types of constraints using similar techniques. In future research we will continue to investigate the generalization of these techniques to other types of constraints, such as linear constraints and improved propagators for the interdistance constraint. The research melds the use of known and novel data structures designed for this specific purpose together with theoretical proofs of certain properties of the constraints under consideration which allow a speed up of the search. In the case of the interdistance constraint we showed that a certain set of Ө(n3) intervals contains only O(n2) intervals of relevance, and we are hoping to improve this bound even further. This is joint work with C-G. Quimper (recent PhD graduate) and G. Pesant. Top