Information for

2006 technical reports

Title SI-WF 2Q: WF2Q Approximation with Small Constant Execution Overhead - Extended Version
Authors Martin Karsten
Abstract This paper presents a novel priority queue data structure and access operations for timer maintenance in the context of traffic shaping and scheduling in packet networks. The data structure and operations are used to construct an efficient General Processor Sharing (GPS) approximation scheduler. In contrast to existing proposals, this scheduler has constant and near-optimal delay and fairness properties, /and/ can be implemented with bounded amortized O(1) algorithmic complexity, /and/ has very low absolute execution overhead. The paper presents the data structure, the scheduling algorithm, and studies its execution complexity. The scheduling properties are analyzed and shown to relate nicely to existing work. Some illustrative simulation results are presented to reaffirm those properties.
Date January 2006
Report SI-WF 2Q: WF2Q Approximation with Small Constant Execution Overhead - Extended Version (PDF)
Title An Investigation on Sieve and Detour Effects Affecting the Interaction of Infrared Radiation with Plant Leaves
Authors Gladimir V. G. Baranoski and Denise Eng
Abstract The retrieval of plant biophysical and biochemical properties from high spectral resolution data represents an active area of research within the remote sensing field. Scientific studies in this area are usually supported by computational simulations of light attenuation processes within foliar tissues. In heterogeneous organic materials, like plant leaves, sieve and detour effects can affect these processes, and ultimately change the light gradients within these tissues and their spectral signatures. Although these effects have been extensively examined for applications involving the interactions of visible radiation with plant leaves, little is known about their role in the infrared domain. In this paper, we describe the procedural basis for their incorporation in the modelling of infrared radiation transport (in the range of 750 to 2500nm) within plant leaves. We also assess their impact on the predictability of simulation solutions relating the directionality of the incident radiation and the internal arrangement of the tissues to changes on foliar spectral signatures in this domain. Our investigation is grounded by observations involving modeled results and quantitative and qualitative data reported in the literature
Date November 2, 2006
Report An Investigation on Sieve and Detour Effects Affecting the Interaction of Infrared Radiation with Plant Leaves (PDF)
Title Optimal lower bounds for rank and select indexes
Authors Alexander Golynski
Abstract We develop a new lower bound technique for data structures. We show an optimal $\Omega(n \lg\lg n / \lg n)$ space lower bounds for storing an index that allows to implement rank and select queries on a bit vector $B$ provided that $B$ is stored explicitly. These results improve upon [Miltersen, SODA'05]. We show $\Omega((m/t) \lg t)$ lower bounds for storing rank/select index in the case where $B$ has $m$ $1$-bits in it (e.g. low $0$-th entropy) and the algorithm is allowed to probe $t$ bits of $B$. We simplify select index given in [Raman et al., SODA'02] and show how to implement both rank and select queries with an index of size $(1 + o(1)) (n \lg\lg n / \lg n) + O(n / \lg n)$ (i.e. we give an explicit constant for storage) in the RAM model with word size $\lg n$.
Date February 10, 2006
Report Optimal lower bounds for rank and select indexes (PDF)
Title Comparisions on Different Approachs to Assign Missing Attribute Values
Authors Jiye Li and Nick Cercone
Abstract A commonly-used and naive solution to process data with missing attribute values is to ignore the instances which contain missing attribute values. This method may neglect important information within the data, significant amount of data could be easily discarded, and the discovered knowledge may not contain significant rules. Some methods, such as assigning the most common values or assigning an average value to the missing attribute, may make good use of all the available data. However the assigned value may not come from the information which the data originally derived, thus noise is brought to the data. We introduce a new approach RSFit on processing data with missing attribute values based on rough sets theory. By matching attribute-value pairs among the same core or reduct of the original data set, the assigned value preserves the characteristics of the original data set. We compare our approach with "closest fit approach globally" and "closest fit approach in the same concept". Experimental results on UCI data sets and a real geriatric care data set show our approach achieves comparable accuracy on assigning the missing values while significantly reduces the computation time.
Date January 2006
Report Comparisions on Different Approachs to Assign Missing Attribute Values (PDF)
Title A Scalable Peer-to-peer Protocol Enabling Efficient and Flexible Search
Authors Reaz Ahmed and Raouf Boutaba
Abstract Efficient discovery of information, based on partial knowledge, is a challenging problem faced by many large scale distributed systems. This paper presents a peer-to-peer search protocol that addresses this problem. The proposed system provides an efficient mechanism for advertising a binary pattern, and discovering it using any subset of its 1-bits. A pattern (e.g., Bloom filter) summarizes the properties (e.g., keywords or service description) associated with a shared object (e.g., document or service).

The proposed system has a partially decentralized architecture involving superpeers and adopts a novel structured search mechanism derived from the theory of Error Correcting Codes (ECC). Better resilience to peer failure is achieved by utilizing replication and redundant routing paths. The number of routing hops and the number of links maintained by each superpeer scales logarithmically with the number of superpeers. The concept presented in this paper is supported with theoretical analysis, and simulation results.

Date January 2006
Report A Scalable Peer-to-peer Protocol Enabling Efficient and Flexible Search (PDF)
Title The R-Acyclic Semiunification Problem
Authors Brad Lushman, Gordon V. Cormack
Abstract We recast Kfoury and Wells' formulation of the acyclic semiunification problem (ASUP) in graph- theoretic terms and prove equivalence between the two formulations. We then relax and simplify the graph-theoretic formulation; we call the resulting problem the R-acyclic semiunification problem (R-ASUP), which we show to be a strict superset of ASUP. We prove that the ASUP solution procedure terminates and produces most general solutions for R-ASUP (and hence for ASUP) in the same sense as Robinson's unification algorithm. We thus extend the class of semiunification instances known to be decidable.
Date March 13, 2006
Report The R-Acyclic Semiunification Problem (PDF)
Title FIX: Feature-based Indexing Technique for XML Documents
Authors Ning Zhang, M. Tamer Ozsu, Ihab F. Ilyas, and Ashraf Aboulnaga
Abstract In this paper, we study the problem of indexing an XML database. Existing XML indexing techniques focus on clustering methods based on the combinatorial structural properties of an XML document. These techniques cluster tree nodes into an index tree or graph based on their similarities in ancestor-descendant or sibling relationships. Index look-up then amounts to pattern matching on the clustered tree or graph. In this paper, we propose a feature-based indexing technique, called FIX, based on the spectral graph theory. The basic idea is that for each twig pattern in a collection of XML documents, we calculate a vector of features based on its structural properties. These features are used as a key for the patterns and stored in a B-tree or a multidimensional index tree. Given an XPath query, its feature vector is first calculated and looked up in the index. Then a further refinement phase is performed to fetch the final results. We experimentally study the indexing technique over two scenarios: a large collection of relatively smaller documents, and a single large document. Our experiments show that FIX provides great pruning power and could gain an order of magnitude performance improvement for many XPath queries over existing evaluation techniques.
Date March 14, 2006
Report FIX: Feature-based Indexing Technique for XML Documents (PDF)
Title A <I>More</I> Direct Algorithm for Type Inference in the Rank-2 Fragment
of the Second-Order Lambda-Calculus
Authors Brad Lushman, Gordon V. Cormack

We present an algorithm for rank-2 type inference in the second-order $\lambda$-calculus. Our algorithm differs from the well-known algorithm of Kfoury and Wells in that it employs only a quadratically fewer type variables and inequalities. Our algorithm consists of a translation from a $\lambda$-term to an instance of $R$-ASUP (a decidable superset of ASUP) in which the variables correspond more directly to features in the original term. We claim that our construction, being simpler and more direct, is more amenable to proof and extension.

Date March 29, 2006/Revised July 27, 2007
Report A <I>More</I> Direct Algorithm for Type Inference in the Rank-2 Fragment (PDF)
Title Robust Numerical Valuation of European and American Options under the CGMY Process
Authors Iris R. Wang, Justin W.L. Wan and Peter A. Forsyth.
Abstract We develop an implicit discretization method for pricing European and American options when the underlying asset is driven by an infinite activity Levy process. For processes of finite variation, quadratic convergence is obtained as the mesh and time step are refined. For infinite variation processes, better than first order accuracy is achieved. The jump component in the neighbourhood of log jump size zero is specially treated by using a Taylor expansion approximation and the advection term is dealt with using a semi-Lagrangian scheme. The resulting Partial Integro-Differential Equation (PIDE) is then solved using preconditioned BiCGSTAB method coupled with a fast Fourier transform. Proofs of fully implicit timestepping stability and monotonicity are provided. The convergence properties of the BiCGSTAB scheme are discussed and compared with a fixed point iteration. Numerical tests on convergence and performance of European and American options under processes of finite and infinite variation are presented.
Date April 7, 2006
Report Robust Numerical Valuation of European and American Options under the CGMY Process (PDF)
Title Succinct Encodings for XPath Location Steps
Authors JC)rC)my Barbay and S. Srinivasa Rao
Abstract We consider in this paper the problem of encoding XML documents in small space while still supporting XPath Location steps efficiently. We model XML documents as multi-labeled trees, and propose for those an encoding which takes space close to the lower bound suggested by information theory, while still supporting the search for the ancestors, descendants and children matching a given label efficiently. To achieve this goal, we extend the previous results from Golynski et al. on strings over large alphabets, and from Barbay et al. on binary relations, to support the rank and select operators in several orders at once; and we extend the results from Geary et al. on ordinal trees to support the DFUDS order, in addition to the other operators.
Report Succinct Encodings for XPath Location Steps (PDF)
Title Adaptive Search Algorithm for Patterns, in Succinctly Encoded XML
Authors JC)rC)my Barbay
Abstract We propose an adaptive algorithm for context queries (queries expressed as preorder and ancestor-descendant relations on labeled nodes), which can be used to find patterns in XML documents. Our algorithm takes advantage of the correlation between terms of the query without any preprocessed information, and it runs in time (kd(lg lg min(n,s)+lg lg(r))) in the RAM model, where k is the number of terms in the query, d is the non-deterministic complexity of the query on the multi-labeled tree (i.e. the minimum number of operations required to check the answer to the query), n is the number of nodes in the tree, s is the number of relations between nodes and labels, and r is the maximal number of nodes matching a label on any rooted path in the tree.
Date April 21, 2006
Report Adaptive Search Algorithm for Patterns, in Succinctly Encoded XML (PDF)
Title Age and Geographic Inferences of the LiveJournal Social Network
Authors Ian McKinnon, Rob Warren
Abstract Online social networks are often a by-product of blogging and other online media sites on the Internet. Services such as LiveJournal allow their users to specify who their "friends" are, and thus a social network is formed. This paper will explore the relationship between users with the intent of being able to make a prediction of a users age and country of residence based on the information given by their friends on this social network.
Date June, 2006
Report Age and Geographic Inferences of the LiveJournal Social Network (PDF)
Title Predicting Missing Attribute Values based on Frequent Itemset and RSFit
Authors Jiye Li and Nick Cercone
Abstract How to process missing attribute values is an important data preprocessing problem in data mining and knowledge discovery tasks. A commonly-used and naive solution to process data with missing attribute values is to ignore the instances which contain missing attribute values. This method may neglect important information within the data and a significant amount of data could be easily discarded. Some methods, such as assigning the most common values or assigning an average value to the missing attribute, make good use of all the available data. However the assigned value may not come from the information which the data originally derived from, thus noise is brought to the data. We introduce an integrated approach ItemRSFit to effectively predict missing attribute values by combining frequent itemset and RSFit approaches together. Frequent itemset is generated from the association rules algorithm and it displays the correlations between different items in a transaction data set. Using frequent itemset as a knowledge base to predict missing attribute values is shown to have a high prediction accuracy. However this approach alone cannot predict all the existing missing attributes. RSFit is a newly developed approach to predict missing attribute values based on the similarities of attribute-value pairs by only considering attributes contained in the core or the reduct of the data set. The RSFit approach provides a faster prediction and can be used for the cases that are not covered by the itemset approach. Empirical studies on UCI data sets and a real world data set demonstrate a significant increase of predicting accuracy obtained from this new integrated approach.
Date April 25, 2006
Report Predicting Missing Attribute Values based on Frequent Itemset and RSFit (PDF)
Title Seeking Empirical Evidence for Self-Organized Criticality in Open Source Software Evolution
Authors Jingwei Wu, Richard C. Holt
Abstract We examine 11 open source software systems and present empirical evidence for the existence of fractal structures in software evolution. In our study, fractal structures are measured as power laws through the lifetime of a software system. We describe two specific power law related phenomena: the probability distribution of software changes decreases as a power function of change sizes; and the time series of software change exhibits long range correlations with power law behavior. The existence of such spatial (across the system) and temporal (over the system lifetime) power laws suggests that Self-Organized Criticality (SOC) occurs in the evolution of open source systems. As a consequence, SOC may be useful as a conceptual framework for understanding software evolution dynamics (the cause and mechanism of change or growth). We give a qualitative explanation of software evolution based on SOC. We also discuss some implications of SOC to current software practices.
Date April 26, 2006
Report Seeking Empirical Evidence for Self-Organized Criticality in Open Source Software Evolution (PDF)
Title Collaborative and Coordinated Product Configuration
Authors Marcilio Mendonca, Toacy Oliveira, Donald Cowan
Abstract Product configuration is a key activity of product engineering that regards the constrained combination and parameterization of product line assets as a means to achieve correct software specification. Current product configuration approaches frequently rely on the application engineer to translate user requirements into correct configuration choices. This process is error-prone and risky as requirements may lead to conflicting decisions at configuration time. Indeed, we deem that an important aspect of product configuration has long been neglected: its collaborative nature. In our research, we advocate that product configuration is enhanced by a collaborative perspective, providing that conflicting scenarios are properly handled. We propose an approach to support collaborative and co-ordinated product configuration by promoting processes to first-order elements for the explicit guidance of configuration decisions. We provide insights on important coordination issues and introduce an algorithm to derive process models from annotated feature models to illustrate the approach's feasibility. Keywords: Product Configuration, Software Processes, Software Product Lines, Collaborative Software Configuration.
Date May 17, 2006
Report Collaborative and Coordinated Product Configuration (PDF)
Title Geometrical Mesh Improvement Properties of Delaunay Terminal Edge Refinement
Authors Bruce Simpson, David Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1
and Maria-Cecilia Rivars, Department of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile
Abstract "The use of edge based refinement in general, and Delaunay terminal edge refinement in particular are well established for planar meshing, but largely on a heuristic basis. In this paper, we present a series of theoretical results on the geometric mesh improvement properties of these methods. The discussion is based on refining a mesh to meet a specified angle tolerance."
Date July 25, 2006
Report Geometrical Mesh Improvement Properties of Delaunay Terminal Edge Refinement (PDF)
Title A Program Extractor Suite for C and C++: Choosing the Right Tool for the Job
Authors Jingwei Wu, Richard C. Holt
Abstract This report describes a suite of program extractors, which we developed by adopting and extending existing program parsing and extraction techniques or tools. This suite is called CX because it is mainly targeted at extracting facts from C and C++ programs. This suite is currently composed of four extractors: CPPX, BFX, LDX and CTSX. The main goal of creating CX is to provide a convenient set of program extractors, which complement each other and work in a systematic manner. The benefits of this extractor suite will be discussed in terms of two practical applications: (1) creating program comprehension pipelines to support various understanding tasks, and (2) building an open source software evolution database (EvolDB) to support empirical research on software evolution.
Date June 3, 2006
Report A Program Extractor Suite for C and C++: Choosing the Right Tool for the Job (PDF)
Title A Survey of Data Management in Peer-to-Peer Systems
Authors Rolando Blanco, Nabeel Ahmed, David Hadaller, L. G. Alex Sung, Herman Li, and Mohamed Ali Soliman
Abstract Peer-to-Peer (P2P) systems provide a decentralized infrastructure for resource sharing. In particular, file sharing was the initial motivation behind many of the first successful P2P systems. As P2P systems evolve to support sharing of structured and semantically rich data, several data management issues must be addressed. Specifically, P2P systems need to deal with data location, data integration, data querying, and data consistency issues. Data management is further complicated in P2P systems due to the volatility and scale of these systems. In this survey we propose a reference architecture for P2P systems that focuses on the data management activities required to support sharing of structured data. Based on this reference architecture, we present and classify technologies that have been proposed to solve the data management issues particular to P2P systems.
Date June 20 , 2006
Report A Survey of Data Management in Peer-to-Peer Systems (PDF)
Title Incremental Maintenance of Global Aggregates
Authors Nabeel Ahmed, David Hadaller, and Srinivasan Keshav
Abstract Providing local access to global information can improve the efficiency of many distributed applications. Examples include applications that aggregate sensor values, search in Peer-to-Peer systems, or perform Top-K queries in stream-oriented databases. Efficient computation of such aggregates is difficult due to the massive scale and dynamics of such systems and has led to the proposal of several approximate techniques based on randomized gossip algorithms. However, maintenance of such aggregates has not been adequately addressed in the literature. Changes in node state, therefore, require a full and expensive re-computation of the global aggregate. This paper makes three contributions to this field. First, we propose a variant of the well-known FM aggregation scheme that allows us to support incremental maintenance of aggregates. Second, we propose the concept of significance thresholds and illustrate their benefits. Finally, we present a detailed performance evaluation of our techniques and find that we can reduce computation time by 60% compared to recomputing the aggregate, as is traditionally done.
Date June 21, 2006
Report Incremental Maintenance of Global Aggregates (PDF)
Title "On Pattern Expression Languages"
Authors Cezar Campeanu and Nicolae Santean
Abstract In this paper we show that the family of pattern expression languages is closed under the intersection with regular languages. Since this family is not closed under complement but is closed under reverse, a natural question arises, that is, whether particular languages such as those containing words of type $ww^R$ are pattern expression languages or not. We give a proof for a negative answer to this question, and we provide several examples of languages which can not be specified by pattern expressions.

Keywords: pattern, pattern automata, pattern expressions, regex, regular expressions

Date December 11, 2006
Report "On Pattern Expression Languages" (PDF)
Title A Robust Overlay for Distributed Range Query Processing
Authors AndrC) Allavena Qiang Wang Ihab Ilyas Srinivasan Keshav
University of Waterloo
{aallavena, q6wang, ilyas, keshav}
Abstract Large-scale data-centric services are often handled by clusters of computers that include hundreds of thousands of computing nodes. However, traditional distributed query processing techniques fail to handle the large-scale distribution, peer-to-peer communication and frequent disconnection. In this paper, we introduce LOT, a robust, fault-tolerant and highly distributed overlay network for large-scale peer-to-peer query processing. LOT is based on a robust tree overlay for distributed systems. It uses virtualization, replication, geographic-based clustering and flexible state definition as basic design principles. We show how we map these principles to desirable performance goals. Moreover, we provide a light-weight maintenance mechanism for updating state information. Analysis and simulations show that our approach is superior to other well-known alternatives in its query processing performance and handling of churn.
Date July 14, 2006
Report A Robust Overlay for Distributed Range Query Processing (PDF)
Title Fast, Efficient and Robust In-Network Computation
Authors AndrC) Allavena and Srinivasan Keshav
University of Waterloo
Abstract Today, companies such as eBay, Amazon, Google, and IBM routinely operate clusters with more than 10,000 servers located in data centres around the world. Developing applications that efficiently use the resources of such a distributed cluster is challenging. This task is made eased by a middleware layer that provides application programmers with the illusion of dynamically updated "global state". Maintaining global state can be viewed as repeatedly computing an arbitrary function over a set of time-varying values, where the values are held at each node, and every node needs to know the resultant function. Many well-known problems in distributed systems, such as load balancing, leader election, barrier synchronisation, distributed file system maintenance, and co-ordinated intrusion detection can be cast in this form.

We present and rigorously analyse an algorithm that uses a a tree of virtual nodes to compute nearly arbitrary functions of global state. Our scheme is fast: running time is $\Theta(ln n)$. It is efficient: nodes exchange O(n ln n) messages. Most of these messages are within a data centre and therefore are relatively cheap. It is accurate: the computed value does not have inherent errors either due to double-counting, as with standard gossip, or due to stochastic counting, as in the Flajolet-Martin approach. Finally, it is fault tolerant: The algorithm fails with probability O( 1 / n^{c(1+r)} )$ where c is a constant and 0 <= r \leq (\ln n)^Cst a user-chosen reliability parameter. We therefore believe that our work is the basis for building robust and efficient distributed systems in a variety of problem domains.

Date July 14, 2006
Report Fast, Efficient and Robust In-Network Computation (PDF)
Title Weighted Queries on Binary Relations and Multi-Labeled Trees
Authors Jeremy Barbay, Aleh Veraskouski
Abstract The common way to search large indexed databases is through conjunctive queries on binary relations, such as those associating keywords to web page references. We extend this model by adding positive weights to the terms of the query. In this context we give and analyze two algorithms to answer such queries in a pertinent way. We extend this approach to solve weighted queries on tree-structured objects such as file-systems or XML documents.
Date July 27, 2006
Report Weighted Queries on Binary Relations and Multi-Labeled Trees (PDF)
Title Succinct Indexes for Strings, Binary Relations and Multi-labeled Trees
Authors Jeremy Barbay, Meng He, J. Ian Munro, and S. Srinivasa Rao
Abstract We define and design succinct indexes for several abstract data types (ADTs). The concept is to design auxiliary data structures called succinct indexes that occupy asymptotically less space than the information-theoretic lower bound on the space required to encode the given data, and support an extended set of operations using the basic operators defined in the ADT.

As opposed to succinct encodings, The main advantage of succinct indexes is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data.In this paper, we present succinct indexes for various data types, namely strings, binary relations and multi-labeled trees. Given the support for the interface of the ADTs of these data types, we can support various useful operations efficiently by constructing succinct indexes for them. When the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Using our techniques, we design the first succinct encoding that represents a string of length $n$ over alphabet $[\sigma]$ using $nH_k+o(n\lg \sigma)$ bits \footnote{We use $\lg n$ to denote $\lceil \log_2 n \rceil$.} that support access / rank / select operations in $o((\lg\lg \sigma)3)$ time. We also design the first succinct text index using $nH_k+o(n\lg\sigma)$ bits that supports pattern matching queries in $O(m\lg\lg\sigma + occ\lg^{1+\epsilon}n\lg\lg\sigma)$ time, for a given pattern of length $m$. Previous results on these two problems either have a $\lg\sigma$ factor instead of $\lg\lg\sigma$ in terms of running time, or are not compressible, but our results do not have such problems.

Date July 27, 2006
Report Succinct Indexes for Strings, Binary Relations and Multi-labeled Trees (PDF)
Title Top-k Query Processing in Uncertain Databases
Authors Mohamed A. Soliman, Ihab F. Ilyas, Kevin Chen-Chuan Chang

Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty information makes traditional processing techniques inapplicable to uncertain databases. In this paper we introduce new probabilistic formulations for top-k queries. Our formulations are based on marriage of traditional top-k semantics with possible worlds semantics. In the light of these formulations, we construct a framework that encapsulates a novel probabilistic model and efficient query processing techniques to tackle the challenges raised by uncertain data settings. We prove that our techniques are optimal in terms of the number of accessed tuples. Our experiments show the efficiency of our techniques under different data distributions with orders of magnitude improvement over naive materialization of possible worlds.

Date August 2, 2006 /Updated August 14, 2007
Report Top-k Query Processing in Uncertain Databases (PDF)
Title Multi-Query Optimization of Sliding Window Aggregates by Schedule Synchronization
Authors Lukasz Golab, Kumar Gaurav Bijay, and M. Tamer Ozsu
Abstract Data stream systems process persistent queries, typically posed over sliding windows and re-evaluated periodically as the windows slide forward. Due to their long-running nature, a number of similar persistent queries may run in parallel at any given time, therefore multi-query optimization is particularly important. In traditional multi-query optimization, one of the primary goals is to detect common parts across multiple queries issued at the same time and perform the common task only once. Another related goal is to check if a new query may be answered using the answer of a related query which has been computed previously (and is stored as a materialized view). In this paper, we argue that in the context of periodically re-evaluated queries, multi-query optimization requires an additional step beyond common sub-expression matching. This is because queries that have been identified as similar may be re-evaluated with different frequencies and therefore may be scheduled at different times. Thus, the additional step must attempt to synchronize the re-execution times of similar queries in order to take advantage of computation sharing.

The solution presented in this paper focuses on periodically-evaluated aggregates over sliding windows of various lengths, which are a common class of persistent queries used for monitoring purposes. The proposed solution assumes that users specify an upper bound on the interval between re-evaluations of their queries and is based upon the following insight: it may be cheaper to re-execute some queries more often if their re-execution schedules can be synchronized with those of similar queries, thereby amortizing the computation costs. We also show that additional schedule synchronization is possible when the system is forced to lengthen the desired re-execution intervals during periods of overload. Our solutions are based upon a variant of the earliest-deadline-first algorithm that views persistent queries are periodic tasks. Experimental results show significant improvements in system throughput due to increased resource sharing.

Date August, 2006
Report Multi-Query Optimization of Sliding Window Aggregates by Schedule Synchronization (PDF)
Title Sliding Window Query Processing over Data Streams
Authors Lukasz Golab
Abstract Database management systems (DBMSs) have been used successfully in traditional business applications that require persistent data storage and an efficient querying mechanism. Typically, it is assumed that the data are static, unless explicitly modified or deleted by a user or application. Database queries are executed when issued and their answers reflect the current state of the data. However, emerging applications, such as sensor networks, real-time Internet traffic analysis, and on-line financial trading, require support for processing of unbounded data streams. The fundamental assumption of a data stream management system (DSMS) is that new data are generated continually, making it infeasible to store a stream in its entirety. At best, a sliding window of recently arrived data may be maintained, meaning that old data must be removed as time goes on. Furthermore, as the contents of the sliding windows evolve over time, it makes sense for users to ask a query once and receive updated answers over time.

This dissertation begins with the observation that the two fundamental requirements of a DSMS are dealing with transient (time-evolving) rather than static data and answering persistent rather than transient queries. One implication of the first requirement is that data maintenance costs have a significant effect on the performance of a DSMS. Additionally, traditional query processing algorithms must be re-engineered for the sliding window model because queries may need to re-process expired data and "undo" previously generated results. The second requirement suggests that a DSMS may execute a large number of persistent queries at the same time, therefore there exist opportunities for resource sharing among similar queries.

The purpose of this dissertation is to develop solutions for efficient query processing over sliding windows by focusing on these two fundamental properties. In terms of the transient nature of streaming data, this dissertation is based upon the following insight. Although the data keep changing over time as the windows slide forward, the changes are not random; on the contrary, the inputs and outputs of a DSMS exhibit patterns in the way the data are inserted and deleted. It will be shown that the knowledge of these patterns leads to an understanding of the semantics of persistent queries, lower window maintenance costs, as well as novel query processing, query optimization, and concurrency control strategies. In the context of the persistent nature of DSMS queries, the insight behind the proposed solution is that various queries may need to be refreshed at different times, therefore synchronizing the refresh schedules of similar queries creates more opportunities for resource sharing.

Date August, 2006
Report Sliding Window Query Processing over Data Streams (PDF)
Title "Deterministic Simulation of a NFA with k-symbol Lookahead"
Authors Bala Ravikumar and Nicolae Santean
Abstract We investigate deterministically simulating (i.e., solving the membership problem for) nondeterministic finite automata (NFA), relying solely on the NFA's resources (states and transitions). Unlike the standard NFA simulation, involving an algorithm which stores at each step all the states reached nondeterministically while reading the input, we consider deterministic finite automata (DFA) with lookahead, which choose the "right" NFA transitions based on a fixed number of input symbols read ahead. This concept, known as lookahead delegation, arose in a formal study of web services composition and its subsequent practical applications. Here we answer several related questions, such as "when is lookahead delegation possible?" and "how hard is it to find a delegator with a given lookahead buffer size?". In particular, we show that only finite languages have the property that all of their NFA's have delegators. This implies, among others, that delegation is a machine property, rather than a language property. We also prove that the existence of lookahead delegators for unambiguous NFA is decidable, thus partially solving an open problem. Finally, we show that finding delegators (even for a given buffer size) is hard in general, and is efficient for unambiguous NFA, and we give an algorithm and a compact characterization for NFA delegation in general.
Date August, 2006
Report "Deterministic Simulation of a NFA with k-symbol Lookahead" (PDF)
Title Query Processing and Optimization in Native XML Databases
Author Ning Zhang

XML has emerged as a semantic markup language for documents as well as the de facto language for data exchange over the World Wide Web. Declarative query languages, such as XPath and XQuery, are proposed for querying over large volumes of XML data. A number of techniques have been proposed to evaluate XML queries more efficiently. Many of these techniques assume a tree model of XML documents and are, therefore, also applicable to other data sources that can be explicitly or implicitly translated into a similar data model.

The focus of this thesis is on efficient evaluation and optimization of path expressions in native XML databases. Specifically, the following issues are considered: storage system design, design of physical operators and efficient execution algorithms, and the cost-based query optimizer.

The proposed storage system linearizes the tree structure into strings that can be decomposed into disk pages. Simple statistics are kept in the page headers to facilitate I/O-efficient navigation. Based on this storage system, a hybrid approach is developed to evaluate path expressions that exploit the advantages of navigational and join-based approaches. Path expressions that only contain "local" axes (child, parent, attribute, self, following-sibling, and preceding-sibling) are evaluated by means of the proposed "Next-of-Kin" (NoK) operator. A general path expression that contains both local axes and global ones (ancestor, descendant, ancestor-or-self, descendant-or-self, following, and preceding) is decomposed into NoK subtrees whose intermediate results are structurally joined to produce the final result. Experiments show that the navigational operator can be an order of magnitude faster than join-based approaches in some cases, but slower in others. Thus a cost-based query optimizer is necessary to choose the optimal execution plan based on estimates of the cost of each operator.

The cost of an operator heavily depends on the cost model and its input. The inputs to the cost model are usually the cardinalities of path expressions. In this thesis, a synopsis structure called XSEED is proposed to estimate the cardinality of a path expression. An XSEED synopsis can be constructed by compressing an XML document to a small kernel first, and then more information can be added to the synopsis. XSEED results in more accurate cardinality estimation than previous approaches and is easier to construct, easier to update, and can incorporate query feedback.

Efficient query execution exploits indexes. The last component of the thesis is a feature-based structural index, called FIX, to expedite tree matching on a large tree. This is based on the observation that the navigational operation is expensive and applying it to every tree node is very inefficient. FIX extracts distinctive features from subtrees and uses them as the index keys. Similarly, for each incoming query, the features of the query tree are extracted and used as search keys to retrieve the candidate results. Thus, it is sufficient to match only on these candidate results. The experimental results show that the pruning power of FIX is very high -- more than 99% for structure-rich data sets, and more than 20% for data sets with less structural variety.

Date August, 2006
Report Query Processing and Optimization in Native XML Databases (PDF)
Title "On the definition of stochastic lambda-transducers"
Authors Stavros Konstantinidis and Nicolae Santean
Abstract We propose a formal definition for the general notion of stochastic transducer, called stochastic lambda-transducer. Our definition is designed with two objectives in mind: (i) to extend naturally the established notion of stochastic automaton with output - as defined in the classic books of Paz (1971) and Starke (1972) - by permitting pairs of input-output words of different lengths; (ii) to be compatible with the more general notion of weighted transducer so that one can apply tools of weighted transducers to address certain computational problems involving stochastic transducers. The new transducers can be used to model stochastic input-output processes that cannot be modeled using classic stochastic automata with output.

Keywords: Probabilistic transducer, Probabilistic automaton, Stochastic transducer, Stochastic automaton, Stochastic transduction, Weighted transducer, Transducer, Automaton

Date September 5, 2006
Report "On the definition of stochastic lambda-transducers" (PDF)
Title Generalized Labeled LCA Queries
Authors Jeremy Barbay, Ehsan Chiniforooshan, Alexander Golynski, Jui-Yi Kao, Aleh Veraskouski
Abstract Schema-free queries permit to search XML documents without knowing their schema. Among them, lowest common ancestor (LCA) queries were introduced in several variants on labeled trees. We define threshold LCA queries to generalize all those variants, and to extend them to the case where weights are assigned to each term of the query. We study how to solve those in the context where access to the document is streamed, and in the context where the document is accessed through a precomputed index. We propose space-efficient algorithms for both contexts, using space O(h+s) independant from the size of the document, where h is the height of the input document and s is the number of different labels. We describe two distinct algorithms in the streamed model, both of which read the input stream exactly once and run in linear time, and one algorithm in the indexed model, which provably runs in sublinear time for many instances, characterized by a difficulty measure.
Date Revised July 2007
Report Generalized Labeled LCA Queries (PS) Generalized Labeled LCA Queries (PDF)
Title Graph Isomorphism Completeness for Perfect Graphs and Subclasses of Perfect Graphs
Authors C. Boucher and D. Loker
Abstract A problem is said to be GI-complete if it is provably as hard as graph isomorphism; that is, there is a polynomial-time Turing reduction from the graph isomorphism problem. It is known that the GI problem is GI-complete for some special graph classes including regular graphs, bipartite graphs, chordal graphs and split graphs. In this paper, we prove that deciding isomorphism of double split graphs, the class of graphs exhibiting a 2-join, and the class of graphs exhibiting a balanced skew partition are GI-complete. Further, we show that the GI problem for the larger class including these graph classes--that is, the class of perfect graphs--is also GI-complete.
Date September 6, 2006
Report Graph Isomorphism Completeness for Perfect Graphs and Subclasses of Perfect Graphs (PDF)
Title Expected Approximation guarantees for the Demand Matching Problem
Author C. Boucher, D. Loker
Abstract The objective of the demand matching problem is to obtain the subset $M$ of edges which is feasible and where the sum of the profits of each of the edges is maximized. The set $M$ is feasible if for each vertex $v$ the total demand of edges in $M$ incident to $v$ is at most $b_v$. In the case where each of the edges has one unit profit, the problem becomes finding the subset of largest size and hence, is called the cardinality problem. Shepherd and Vetta [SV06] demonstrate that the integrality gap for the general demand matching problem for bipartite graphs is between $2.5$ and $2.764$, and between $3$ and $3.264$ non-bipartite graphs. We demonstrate that an expected $2.5$-approximation guarantee and $3$-approximation guarantee is achieveable for bipartite graphs and non-bipartite graphs and give some connections to the independent set and weighted independent set problem.
Date September 18, 2006
Report Expected Approximation guarantees for the Demand Matching Problem (PDF)
Title Metro: An Analysis Toolkit for Template Semantics
Authors Joanne M. Atlee, Nancy A. Day,
Jianwei Niu, Eunsuk Kang, Yun Lu, David Fung, Leonard Wong
Abstract We describe the Metro toolkit, which supports software modelling and analysis for requirements notations that have configurable semantics. Metro is based on a formalism, called template semantics, which structures the operational semantics of a family of notations as a predefined parameterized template that is instantiated with user-provided parameter values. Thus, the semantics of a single notation can be expressed succinctly as a set of parameter values to this template. The Metro toolkit takes as input a specification and a set of template-parameter values, and it produces an analyzable model. The toolkit can either translate the specification into the input language of an existing model checker (e.g., SMV), or compile the specification into a more primitive form (e.g., logic, BDDs) that is suitable for analysis. MagicDraw is used as a front-end for editing specifications and animating SMV-generated counterexamples.
Date September 22 , 2006
Report Metro: An Analysis Toolkit for Template Semantics (PDF)
Title Formal Verification of the A-7E Software Requirements using Template Semantics
Authors Eunsuk Kang and Nancy A. Day
Abstract Template semantics is a template-based approach to ease the process of identifying the essential differences among model-based notations. In this approach, a template captures semantics that are common among notations and allows users to specify only the distinctive features of a notation. In this paper, we illustrate the method of describing requirements in Software Cost Reduction (SCR) using the Metro toolkit, which is the framework for the modelling and analysis of notations in template semantics. Furthermore, we demonstrate the usage of Metro to verify the A-7E software requirements and compare our verification effort to an alternative method of requirements analysis, which does not use template semantics.
Date September 22 , 2006
Report Formal Verification of the A-7E Software Requirements using Template Semantics (PDF)
Title Improved Bounds for the Online Steiner Tree Problem in Graphs of Bounded Edge-Asymmetry
Author Spyros Angelopoulos
Abstract In this paper we consider the Online Steiner Tree problem in weighted directed graphs of bounded edge-asymmetry $\alpha$. The edge-asymmetry of a directed graph is defined as the maximum ratio of the cost (weight) of antiparallel edges in the graph. The problem has applications in multicast routing over a network with non-symmetric links. We improve the previously known upper and lower bounds on the competitive ratio of any deterministic algorithm due to Faloutsos et al. In particular, we show that a better analysis of a simple greedy algorithm yields a competitive ratio of $O(\min \{k, \frac{\alpha \log k} {\log \log \alpha}\})$, where $k$ denotes the number of terminals requested. On the negative side, we show a lower bound of $\Omega(\min \{k^{1-\epsilon}, \frac{\alpha \log k}{\log \log k} \})$ on the competitive ratio of every deterministic algorithm for the problem, for any arbitrarily small constant $\epsilon$.
Date October 2, 2006
Report Improved Bounds for the Online Steiner Tree Problem in Graphs of Bounded Edge-Asymmetry (PDF)
Title Optimal Superblock Instruction Scheduling for Multiple-Issue Processors using Constraint Programming
Authors Abid M. Malik, Tyrel Russell, Michael Chase, and Peter van Beek
Abstract Modern processors have multiple pipelined functional units and can issue more than one instruction per clock cycle. This puts great pressure on the instruction scheduling phase in a compiler to expose maximum instruction level parallelism. Instruction level parallelism (ILP) at the local level using basic blocks is limited. Compilers increase ILP by doing instruction scheduling at the global level using larger regions, which are created by combining basic blocks. Superblocks are one of the most commonly used scheduling regions for global instruction scheduling. Superblock scheduling is NP-complete, and is done sub-optimally in production compilers using heuristic approaches. In this work, we present a constraint programming approach to the superblock instruction scheduling problem that is both optimal and fast enough to be incorporated into production compilers. We experimentally evaluated our optimal scheduler on the SPEC2000 integer and floating point benchmarks. On this benchmark suite, the optimal scheduler was very robust and scaled to the largest superblocks. Depending on the architectural model, between 99.991% to 99.999% of all superblocks were solved to optimality. The scheduler was able to routinely solve the largest superblocks, including superblocks with up to 2600 instructions. This compares favourably to the recent best work by Shobaki and Wilken on optimal superblock scheduling using dynamic programming and enumeration.
Date October 17, 2006
Report Optimal Superblock Instruction Scheduling for Multiple-Issue Processors using Constraint Programming (PDF)
Title Study of the Meiotic Recombination Hotspot Diffusion Paradox
Authors Jérémy Barbay
Abstract Zoologists and evolutionists ponder about an apparent paradox in the current model of how sexual reproduction happens, basing their conclusions on extensive simulations. We show through a mathematical analysis that the results of those simulations can be predicted for a larger class of models, and we deduce from this analysis one of the key features of the model which yield the paradox. Based on this analysis, we define another mathematical model which solves this paradox, and we check our results through simulation.
Date October 10, 2006
Report Study of the Meiotic Recombination Hotspot Diffusion Paradox (PS) Study of the Meiotic Recombination Hotspot Diffusion Paradox (PDF)
Title Caspian: A QoS-Aware Deployment Approach for Channel-based Component-based Applications
Authors Abbas Heydarnoori
Abstract With significant advances in software development technologies in recent years, it is now possible to have complex software applications that include a large number of heterogeneous software components distributed over a large network of computers with different computational capabilities. To run such applications, their components must be instantiated on proper hardware resources in their target environments so that requirements and constraints are met. This process is called software deployment. For large, distributed, component-based applications with many constraints and requirements, it is difficult to do the deployment process manually and automated tools and techniques are required. This report presents a graph-based approach for this purpose that is not dependent on any specific component technology and does the deployment planning with respect to the communication resources, i.e. channels, required by application components and communication resources available on the hosts in the target environment. In our approach, component-based applications and distributed environments are modeled with the help of graphs. Deployment of an application is then defined as the mapping of the application graph to the target environment graph.
Date November 6, 2006
Report Caspian: A QoS-Aware Deployment Approach for Channel-based Component-based Applications (PDF)
Title Unaligned Binary Codes for Index Compression in Schema-Independent Text Retrieval Systems
Authors Stefan Buettcher and Charles L. A. Clarke
Abstract We examine index compression techniques for schema-independent inverted files used in text retrieval systems. Schema-independent inverted files contain full positional information for all index terms and allow the structural unit of retrieval to be specified dynamically at query time, rather than statically during index construction. Schema-independent indices have different characteristics than document-oriented indices, and this difference can affect the effectiveness of index compression algorithms greatly.

Our experimental results show that unaligned binary codes that take into account the special properties of schema-independent indices achieve better compression rates than methods designed for compressing document indices and that they can reduce the size of the index by around 15% compared to byte-aligned index compression. Moreover, we present a number of performance-enhancing techniques that may be used to very efficiently decode unaligned codes. Thus, their more compact index representation does not carry the cost of a substantially decreased query processing performance. This contradicts most earlier results on the decoding performance of unaligned codes.

Date October 11, 2006
Report Unaligned Binary Codes for Index Compression in Schema-Independent Text Retrieval Systems (PDF)
Title BSim: A System for Three-Dimensional Visualization of Brownian Motion
Authors Tenn Francis Chen and Gladimir V. G. Baranoski
Abstract Brownian motion is considered to be of fundamental importance for theoretical and applied scientific research. To date, the computational renderings of this phenomenon available in the scientific literature have been limited to two-dimensional case studies. This paper describes an interactive computer graphics system for the three-dimensional visualization of this phenomenon. The visual simulations are based on the formulas provided in Einstein's seminal paper on this topic, which are carefully revisited to introduce the phenomenon's underlying physics. The system's predictive capability, which is a key attribute for educational and scientific applications, is demonstrated by the qualitative agreement between simulation results and actual Brownian motion observations.
Date November 22, 2006 (Revised April 29, 2009)
Report BSim: A System for Three-Dimensional Visualization of Brownian Motion (PDF)
Title Second-Tier Cache Management Using Write Hints
Authors Xuhui Li, Ashraf Aboulnaga, Kenneth Salem, Aamer Sachedina, Shaobo Gao
Abstract Storage servers, as well as storage clients, typically have large memories in which they cache data blocks. This creates a two-tier cache hierarchy in which the presence of a first-tier cache (at the storage client) makes it more difficult to manage the second-tier cache (at the storage server). Many techniques have been proposed for improving the management of second-tier caches, but none of these techniques use the information that is provided by {\em writes} of data blocks from the first tier to help manage the second-tier cache. In this paper, we illustrate how the information contained in writes from the first tier can be used to improve the performance of the second-tier cache. In particular, we argue that there are different reasons why storage clients write data blocks to storage servers (e.g., cleaning dirty blocks vs. limiting the time to recover from failure). These different types of writes can provide strong indications about the current state and future access patterns of a first-tier cache, which can help in managing the second-tier cache. We propose that storage clients inform the storage servers about the types of writes that they perform by passing {\em write hints}. These write hints can then be used by the server to manage the second-tier cache. We focus on the common and important case in which the storage client is a database system running a transactional (OLTP) workload. We describe, for this case, the different types of write hints that can be passed to the storage server, and we present several cache management policies that rely on these write hints. We demonstrate using trace driven simulations that these simple and inexpensive write hints can significantly improve the performance of the second-tier cache.
Date January 18, 2007
Report Second-Tier Cache Management Using Write Hints (PDF)
Title Semantic Prefetching of Correlated Query Sequences
Authors Ivan T. Bowman, Kenneth Salem
Abstract We present a system that optimizes sequences of related client requests by combining small requests into larger ones, thus reducing per-request overhead. The system predicts upcoming requests and their parameter values based on past observations, and prefetches results that are expected to be needed. We describe how the system makes its predictions and how it uses them to optimize the request stream. We also characterize the benefits with several experiments.
Date November 21, 2006
Report Semantic Prefetching of Correlated Query Sequences (PDF)
Title Correcting Sample Selection Bias by Unlabeled Data
Authors Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, Bernhard Sholkopf
Abstract We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by marching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.
Date January 18, 2007
Report Correcting Sample Selection Bias by Unlabeled Data (PDF)
Title Skyline and Top-k Processing in Web Bargaining
Authors Mohamed A. Soliman, Ihab F. Ilyas, Nick Koudas
Abstract Skyline and top-k queries are gaining increasing importance in many emerging applications. Current skyline and top-k query processing techniques work on deterministic object attributes and known scores. However, in many practical scenarios these settings are inapplicable. In this paper we focus on web interaction scenarios where each interaction is a data object with a set of possible outcomes (scores). Obtaining exact interaction scores is expensive as it involves complex arbitration between interacting parties. Moreover, the outcome of each interaction might redefine other interactions scores. We demonstrate that the search space for solving such problems is very large. Based on this we formulate and present skyline and top-k processing algorithms that can efficiently reduce the search space. We present the results of a thorough experimental evaluation quantifying the relative performance of the algorithms we propose herein with respect to costly exact solutions. Our results indicate that our techniques can efficiently reduce the space and identify precise solutions.
Date November 23, 2006
Report Skyline and Top-k Processing in Web Bargaining (PDF)
Title List Update with Locality of Reference: MTF Outperforms All Other Algorithms
Authors Spyros Angelopoulos, Reza Dorrigiv, and Alejandro Lopez-Ortiz
Abstract It has been observed that in practice, typical request sequences for the list update problem exhibit a certain degree of locality of reference. We first extend the locality of reference model for the paging problem due to Albers et al~[STOC 2002/JCSS 2005] to the domain of list update; this addresses the open question of defining an appropriate model for capturing locality of reference in the context of list update [Hester and Hirschberg ACM Comp. Surv. 1985]. We then apply this model in conjunction with a recent technique for comparing online algorithms, namely bijective analysis~[SODA 2007] and analyze well known online algorithms for list update.

Using this framework, we prove that Move-to-Front (MTF) is the unique optimal algorithm for list update. This holds for both the standard cost function of Sleator and Tarjan~[C. ACM 1985] and the refined cost function proposed independently by Mart\'{\i}nez and Roura~[TCS 2000] and Munro~[ESA 2000]. Our work resolves an open conjecture of Mart\'{\i}nez and Roura, namely proposing a measure which can successfully separate MTF from all other algorithms.

Date November 22, 2006
Report List Update with Locality of Reference: MTF Outperforms All Other Algorithms (PDF)
Title MAXSM: A MultiHeuristic Approach to XML Schema Matching
Authors Mirza Beg, Laurent Charlin, Joel So
Abstract In this paper, we propose an automatic schema matching approach called MAXSM, designed specifically for matching schemas in the context of enterprise data integration. By focusing on enterprise data integration scenarios, it becomes viable to maintain a repository of established schema mappings and reuse that repository to enhance the accuracy of mapping candidates produced by MAXSM. As a consequence of focusing on the enterprise integration space, we further focus MAXSM on XML schema matching, specifically, we consider XML Schema Definition (XSD) schemas. MAXSM is a multi-heuristic schema matching approach which employs, amongst other heuristics, a novel heuristic using WordNet for determining natural-language semantic similarity between node labels. MAXSM defines and describes how transitive mappings can be discovered from known mappings and used to seed production of candidate mappings. We present a cogent tree spanning approach to traverse and search the two schemas more effectively. While this paper does not document an implementation of MAXSM, we do discuss a high-level implementation architecture as well as design of experiments to verify its efficacy.
Date December 11, 2006
Report MAXSM: A MultiHeuristic Approach to XML Schema Matching (PDF)
Title Succinct Data-Structures for Labeled Graphs
Authors Jérémy Barbay
Abstract Succinct data-structures support efficient search queries while
using space asymptotically optimal. Such data-structures are known for structures such as binary strings, ordinal and cardinal trees, graphs, binary relations, labeled and multi-labeled trees. We consider graphs where each node is associated to an arbitrary number of labels. We show that some categories of such graphs have a succinct encoding which supports efficiently label-based navigation and search operators.
Date December 11, 2006
Report Succinct Data-Structures for Labeled Graphs (PS) Succinct Data-Structures for Labeled Graphs (PDF)