# 2008 Technical Reports

 <2007 2009>
CS-2008-01
Title A Performance Evaluation of Database Systems on Virtual Machines
Authors Umar Farooq Minhas
Abstract

Virtual machine technologies offer simple and practical mechanisms to address many manageability problems in database systems. For example, these technologies allow for server consolidation, easier deployment, and more flexible provisioning. Therefore, database systems are increasingly being run on virtual machines. This offers many unique opportunities for database research. However, it is also important to understand the cost of virtualization. Virtual machine technologies add a layer of indirection between applications and the hardware that they use (e.g. CPU, memory, disk). This added complexity results in a performance overhead for software systems running in a virtual machine. In this thesis, we present an experimental study of the overhead of running a database workload in a virtual machine. Using a TPC-H workload running on PostgreSQL in a Xen virtual machine environment, we show that Xen does indeed introduce overhead for system calls, page fault handling, and disk I/O. However, these overheads do not translate to a high overhead in query execution time. We show that in all cases the average overhead is less than 10% and, therefore, conclude that the advantages of running a database system in a virtual machine do not come at a high cost in performance.

Date January 23, 2008
CS-2008-02
Title Worst Case Optimal Union-Intersection Expression Evaluation
Authors Ehsan Chiniforooshan, Arash Farzan, Mehdi Mirzazadeh
Abstract

We consider the problem of evaluating an expression con- sisting of unions and intersections of some sorted sets in the comparison model. Given the expression and the sizes of the sets, we are interested in the worst-case complexity of evaluating the expression in terms of the sizes of the sets. We assume the sets in the given expression are indepen- dent. We show a lower bound on this problem and present an algorithm that matches the lower bound asymptotically.

Date February 8, 2008
CS-2008-03
Title JTop Algorithms for Top-k Join Queries
Authors Reza Akbarinia, Ihab F. Ilyas, M. Tamer Özsu, Patrick Valduriez
Abstract

Top-k join queries are very important, because there are many applications in which users need to join multiple inputs and are interested in the top-k join results based on some scoring function that combines some attribute values of each input. One of the most efficient algorithms for top-k join queries is the Rank-Join algorithm. In this report, we first study this algorithm and show that there are many cases where its threshold is lazy, i.e. decreases very slowly, and the algorithm needs to go too far in the lists. Then, we propose a family of efficient algorithms for processing top-k join queries. The main idea is to take advantage of the specific information on join attribute values as well as characteristics of the query and of the underlying system. Our contributions are as follows. First, we propose a general model for the problem of top-k join queries which is useful for databases as well as many other areas of computing. Second, we propose JTop, an efficient top-k join algorithm for systems where random accesses are not expensive. JTop takes advantage of both random and sorted accesses as well as information on the join condition. In contrast to Rank-Join, JTop¹s threshold is not lazy if at least one of the scoring attributes has a reasonable progressive impact on the scoring function. Third, we propose two new algorithms, LR_JTop and NR_JTop, for systems where random accesses are expensive or not supported, respectively. Forth, we propose a new algorithm called BP_JTop which is designed for systems with ³position-based indexing², i.e. when accessing a data item, the index gives its position. For each of our algorithms, we prove that over any database, it stops before or at the same position at which Rank-Join stops. We also show that there is a class of databases over which our algorithm stops at a position that is O(n) times lower than that of Rank-Join, where n is the number of data items. We also conducted an extensive experimental study to evaluate the performance of our algorithms under different data distributions. The performance evaluation shows that our algorithms obtain high performance gains against the Rank-Join algorithm.

Date March 25, 2008
CS-2008-04
Title Extending Typestate Analysis to Multiple Interacting Objects
Authors Nomair A. Naeem, Ondrej Lhotak
Abstract

This paper extends static typestate analysis to temporal specifications of groups of interacting objects, which are expressed using tracematches. Unlike typestate, a tracematch state may change due to operations on any of a set of objects bound by the tracematch. The paper proposes a lattice-based operational semantics equivalent to the original tracematch semantics but better suited to static analysis. The paper defines a static analysis that computes precise local points-to sets and tracks the flow of individual objects, thereby enabling strong state updates of the tracematch state. The analysis has been proved sound with respect to the semantics. A context-sensitive version of the analysis has been implemented as instances of the IFDS and IDE algorithms. The analysis was evaluated on tracematches used in earlier work and found to be very precise. Remaining imprecisions could be eliminated with more precise modeling of references from the heap and of exceptional control flow.

Date March 25, 2008
CS-2008-05
Title State-Complexity Hierarchies of Uniform Languages of Alphabet-Size Length
Authors Janusz Brzozowski and Stavros Konstantinidis
Abstract

We study the state complexity of a special class of simple languages. If A is an alphabet of k letters, then a k-language is a nonempty set of words of length k, that is, a uniform language of length k. We show that every k-language of maximal state complexity is also a uniform language of length k of maximal state complexity. Moreover, we prove that, for every i between the minimal and the maximal state complexities, there is a language of complexity i. The proof is constructive: for each i we exhibit a language of complexity i. We introduce a family of "pi automata'' accepting languages whose words are permutations of the alphabet; the complexities of these languages form a complete hierarchy between k^2-k+3 and 2^k+1. We start with an automaton with k^2-k+3 states and show that states can be added one at a time, until the automaton has 2^k+1 states. We construct another family of automata, based on k-ary trees, whose languages define a complete hierarchy of complexities between 2^k+1 and the maximal complexity. Here, we start with an automaton with the maximal number of states and remove states one at a time, until an automaton with 2^k+1 states is reached.

Date March 11, 2008
CS-2008-06
Title Continuous Languages
Authors Thomas Ang and Janusz Brzozowski
Abstract

A language is prefix-continuous if it satisfies the condition that, if a word w and its prefix u are in the language, then so is every prefix of w that has  u as a prefix. Prefix-continuous languages include prefix-closed languages at one  end of the spectrum, and prefix-free languages, which include prefix codes, at the other. In a similar way, we define suffix-, bifix-, factor-, and subword-continuous languages  and their closed and free counterparts. We generalize these notions to arbitrary binary relations on the set of all words over a finite alphabet. This provides a common framework for diverse languages such as codes, factorial languages and ideals. We examine the relationships among these languages and  their closure properties.

Date March 12, 2008
CS-2008-07
Title Modeling Uncertainty in Duplicate Elimination
Authors George Beskales, Mohamed A. Soliman, Ihab F. Ilyas
Abstract

Real-world databases experience various data quality problems of different causes including heterogeneity of consolidated data sources, imprecision of reading devices, and data entry errors. Existence of duplicate records is a prominent data quality problem. The process of duplicate elimination often involves uncertainty in deciding on the true duplicates. Current tools resolve such uncertainty either through expert intervention, which is not always possible, or by taking destructive decisions that may lead to unrecoverable errors.

In this paper, we approach duplicate elimination from a new perspective treating deduplication procedures as data processing tasks with uncertain outcomes. We propose a complete uncertainty model that compactly encodes the space of clean instances of the input data, and introduce efficient model implementations. We extend our model to capture the behavior of the deduplication process, and allow revising and updating the modeled uncertainty. We apply our model and techniques to state-of-the-art deduplication algorithms to demonstrate the added value of our methods. Our experimental study evaluates the complexity and scalability of our techniques in different configurations.

Date March 31, 2008, updated July 17, 2008
CS-2008-08
Title Reductions of Graph Isomorphism Problems
Authors Margareta Ackerman
Abstract

We present a reduction between subgraph isomorphism and minimal partial subgraph isomorphism. We also provide a new reduction between graph isomorphism and minimal partial graph isomorphism. The new reduction is more efficient and simpler than the previous reduction and can be generalized to subgraph isomorphism. In addition, we show that a reduction from graph isomorphism that makes only one call to an oracle that finds a minimal partial isomorphism exists only if graph isomorphism is in P. We also provide a generalization of the result.

Date March 31, 2008
CS-2008-09
Title Painterly Rendering
Authors Lesley Northam
Abstract

Hertzmann's painterly rendering algorithm is re-visited and a modification is presented that models a painting technique commonly used by artists. This modification preserves gradients and reduces the random "popping" of strokes in animated scenes.

Date April 4 , 2008
CS-2008-10
Title Constraint-Based Typing for ML via Semiunification
Authors Brad Lushman and Gordon V. Cormack
Abstract

We characterize ML type inference as a constraint satisfaction based on a new generalization of the semiunification problem, based on a new class of
constraint variables that we call \emph{unknowns} (we call the semiunification problem with unknowns \emph{USUP}). Unlike previous characterizations based on ordinary semiunification, ours maintains a one-to-one correspondence between terms in the source language and constraints in the problem instance. This correspondence promises to facilitate reasoning about types by backward propagation of information back to the original source program---for example, unsolvable constraint sets could be directly translated back into specific type error information. Previous formulations of ML type inference in terms of semiunification have been less direct, thereby making practical adoption more difficult. We show that our syntax-directed formulation of ML type inference in terms of USUP is sound and complete. We give a solution semiprocedure for USUP, and show that the USUP image of any ML program is decidable. Through these contributions we wish to establish semiunification as a viable alternative to existing constraint-based typing systems for ML.

Date April 7 , 2008
CS-2008-11
Title A Uniform Approach Towards Succinct Representation of Trees
Authors Arash Farzan and J. Ian Munro
Abstract

We propose a uniform approach for succinct representation of various families of trees. The method is based on two recursive decomposition of trees into subtrees. Recursive decomposition of a structure into substructures is a common technique in succinct data structures and has already been shown fruitful in succinct representation of ordinal trees and dynamic binary trees. We take an approach that simplifies the existing representation of ordinal trees allowing the same full set of navigational operations. The approach applied to cardinal (ie k-ary) trees yields a space-optimal succinct representation which allows full set of cardinal-type operations (eg determining child labeled i) as well as the full set of ordinal-type operations (eg reporting subtree size). Existing space-optimal succinct representations had not been able to support both types of operations.

We demonstrate how the approach can be applied to obtain a space-optimal succinct representation for the family of free trees where the order of children is insignificant. Furthermore, we show that our approach can be used to obtain entropy-based succinct representations. We show that our approach matches the degree-distribution entropy suggested by Jansson. We discuss that our approach can be made adaptive to various other entropy measures."

Date May 6, 2008
CS-2008-12
Title Medical Image Segmentation using Level Sets
Authors Tenn Francis Chen
Abstract

Segmentation is a vital aspect of medical imaging. It aids in the visualization of medical data and diagnostics of various dieses. This report presents an implementation of a level set approach for active contour image segmentation. This method is originally developed by Osher and Sethian and then applied to image segmentation by Malladi. No requirements about objects’ shape and allowance for very flexible topology change are key advantages for this method.

Date May 8, 2008
CS-2008-13
Title Mitigating GPS Error in Mobile Environments
Abstract

We study the problem of accurately determining the position of a mobile device using imprecise off-the-shelf GPS devices. Existing work has not yet explored the impact of mobility on GPS error for off-the-shelf devices. We demonstrate that a naive use of GPS in mobile situations can lead to significant errors. Based on both stationary and mobile vehicular experiments, we identify four distinct sources of measurement error, resulting in a surprising 49.0 m of error while traveling at highway speeds of 100 km/h. We quantify the error into systematic error (44.7 m) and random error (4.3 m). We also show how simple techniques can significantly reduce systematic error, yielding up to a ten-fold reduction of error in mobile GPS measurements. We find that mobile error is primarily caused by misreporting when position measurements were taken and infrequent measurements. After correcting for these errors, off-the-shelf devices can be used for accurate position measurement while mobile.

Date July 14, 2008
CS-2008-14
Title Adaptive two-dimensional string matching for protein contact maps
Authors Robert Fraser and Alejandro Lopez-Ortiz
Abstract

Contact maps are two dimensional abstract representations of protein structures. One of the uses of contact maps is for the identi¯ca- tion of patterns which correspond to some known con¯guration of protein secondary structures. In the past, searching for these patterns has gener- ally used a naÄ³ve sliding window approach which is time consuming. We study several approaches that have been used for two dimensional string matching to accelerate the time requirements of these searching opera- tions, and demonstrate experimentally the e±cacy of these algorithms in our domain. Finally, we present an adaptive analysis of the problem, by restricting the search to only those regions of secondary structure that we are interested in. The same searches as performed previously are ex- ecuted again using the adaptive approach, and the improvements make the search tractable.

Date July 17 , 2008
CS-2008-15
Title Efficient Index-based Processing of Join Queries in DHTs
Authors Qiang Wang, Reza Akbarinia, Tamer Ozsu
Abstract

Massively distributed applications require the integration of heterogeneous data from multiple sources. Peer-to-peer (P2P) is one possible network model for these distributed applications and among P2P architectures, distributed hash table (DHT) is well known for its routing performance guarantees. Under a general distributed relational data model, join query operator, an essential component to integrate data from multiple relational tables in centralized DBMS, can be realized over DHT to support data integration tasks in P2P networks.

In this paper, we propose an efficient and adaptive index-based join query operator over DHTs. With attribute-value storage approach, we build decentralized join indices over DHT, facilitating join query processing with reduced bandwidth consumption. Join index information regarding each join query operator is maintained across multiple indexing peers via an adaptive scheme based on peer capacities, alleviating load-balancing problem. Moreover, we develop an algorithm to access distributed indices with proven performance guarantees. Based on the join indices, a semi-join-alike approach is exploited to handle join query processing at indexing peers concurrently, effectively realizing intra-operator parallelism and decreasing query processing latency. Through theoretic analysis and extensive simulation, we demonstrate the effectiveness and efficiency of our approach.

Date July 22, 2008
CS-2008-16
Title Critical Path Heuristic for Automatic Parallelization
Authors Omer Beg
Abstract

This paper presents an algorithm to heuristically partition the data dependence graph representing a basic block or a hyper block of code in order to assign instructions to available processing cores. The algorithm relies on the observation that the critical path in the data dependence graph defines the lower bound on the schedule length of a block and hence the instructions on the critical path should be assigned to the same core.
The algorithm has been implemented in the Trimaran compiler framework. The given heuristic algorithm is much simpler and evaluation shows that it is as good as previous state-of-the-art heuristic algorithms.

Date August 7, 2008
CS-2008-17
Title FLECS: A Data-Driven Framework for Rapid Protocol Prototyping
Authors Mirza Beg, Martin Karsten, and Srinivasan Keshav
Abstract Flecs is a framework for facilitating rapid implementation of packet forwarding protocols. Forwarding functionality of communication protocols can be modeled by a combination of packet processing components called abstract switching elements or Ases. Each Ase is constrained by the axioms of communication which enables us to formally analyze forwarding mechanisms in communication networks. Ases can be connected in a directed graph to define complex forwarding functionality. In this paper we present Flecs, a framework that compiles meta language protocol specification into its Click implementation. It allows rapid prototyping through configuration, as well as specialized implementation of performance-critical functionality through inheritance.
Date September 16, 2008
CS-2008-18
Title Evasive Attack on Stateful Signature-based Network Intrusion Detection Systems
Authors Tung Tran, Issam Aib, Ehab Al-Shaer, and Raouf Boutaba
Abstract

Network Intrusion Detection Systems (NIDS) have a very important role in network security. Many NIDS evasion techniques as well as solutions were proposed in the literature. Supporting stateful signatures is a very critical function in a signature-based NIDS because many multi-stage attacks can only be detected by tracking multiple rules (signatures) matching. In order to detect these attacks, the session state corresponding to an attack is normally simulated in a NIDS. However, due to the application protocol complication and overheads, it is impossible to have a complete simulation of the session state in a NIDS. Many evasion techniques exploit this Achilles’ heel of a NIDS: it does not have a complete image of the actual system it protects, which leads to the situation that the NIDS and its protected system see and explain things differently; therefore, the NIDS is possibly evaded by creative attacks. In this paper, we propose an evasion technique to Snort rule sets using flowbits, making it more susceptible to false negatives. Moreover, we also suggest practical solutions with controllable false positives to prevent the proposed attack and formally proved that the solutions are complete and sound. We implemented a tool called SFET which can automatically parse a rule set, generate evasion sequences against the rule set and produce a patch to the rule set. A significant number of publicly available rule sets are found vulnerable to the proposed attack, which seriously affects the security of Snort users’ systems. Although our proposed attack applies to Snort, the idea can be applicable to any NIDS supporting stateful signatures.

Date September 18, 2008
CS-2008-19
Title Distributed Quality-Lifetime Maximization in Wireless Video Sensor Networks
Authors Eren Gurses, Yuan Lin, Raouf Boutaba
Abstract

Abstract—Owing to the availability of low-cost and low-power CMOS cameras, Wireless Video Sensor Networks (WVSN) has recently become a reality. However video encoding is still a costly process for energy and capacity constrained sensor nodes and this urges the vitality of the control over the network lifetime. In this paper we propose a distributed quality-lifetime control algorithm where quality is simply measured by the visual signal quality. In order to formulate the quality-lifetime problem, we consider the Power-Rate-Distortion (P-R-D) model of the video encoder together with the rate control, medium access and routing functions of the underlying communication protocol and formulate it as a Generalized Network Utility Maximization (GNUM) problem. Then we construct the distributed solution based on duality and proximal point methods with necessary convergence analysis. Simulation results support that optimal quality-lifetime control is possible through the distributed algorithm, where the desired point of operation is simply adjusted by the operator via a configuration parameter

Date September 24, 2008
CS-2008-20
Title A Study on the Appearance of Quartz
Authors Lesley Northam , Gladimir V. G. Baranoski
Abstract

The purpose of this report is to gather mineralogical and physical information relevant for future research aimed at the modeling of the appearance of quartz. Initially, we review the basic characteristics of minerals, with an emphasis on aspects that can be applied in the representation of different types of quartz. We then examine some of the key optical properties of quartz, such as light reflection and transmission, that affect its appearance. Finally, we provide a concise survey of previous works on the simulation of minerals, which can provide practical insights for future efforts towards the photorealistic rendering of quartz.

Date October 8, 2008
CS-2008-21
Title Robust and Scalable Trust Management for Collaborative Intrusion Detection
Authors Carol J Fung Jie Zhang Issam Aib Raouf Boutaba
Abstract

Abstract—The accuracy of detecting intrusions within an Intrusion Detection Network (IDN) depends on the efficiency of collaboration between the peer Intrusion Detection Systems (IDSes) as well as the security itself of the IDN against insider threats. In this paper, we study host-based IDNs and introduce a Dirichlet-based model to measure the level of trustworthiness among peer IDSes according to their mutual experience. The model has strong scalability properties and is robust against common insider threats, such as a compromised or malfunctioning peer. We evaluate our system based on a simulated collaborative host-based IDS network. The experimental results demonstrate the improved robustness, efficiency, and scalability of our system in detecting intrusions in comparison with existing models.

Date September 29, 2008
CS-2008-22
Title An Efficient Storeless Heap Abstraction Using SSA Form
Authors Nomair A. Naeem, Ondrej Lhotak
Abstract Precise, flow-sensitive analyses of pointer relationships often use a storeless heap abstraction. In this model, an object is represented using some abstraction of the expressions that refer to it (i.e. access paths). Many analyses using such an abstraction are difficult to scale due to the size of the abstraction and due to flow sensitivity. Typically, an object is represented by the set of local variables pointing to it, together with additional predicates representing pointers from other objects. The focus of this paper is on the set of local variables, the core of any such abstraction. Taking advantage of certain properties of static single assignment (SSA) form, we propose an efficient data structure that allows much of the representation of an object at different points in the program to be shared. The transfer function for each statement, instead of creating an updated set, makes only local changes to the existing data structure representing the set. The key enabling properties of SSA form are that every point at which a variable is live is dominated by its definition, and that the definitions of any set of simultaneously live variables are totally ordered according to the dominance relation. We represent the variables pointing to an object using a list ordered consistently with the dominance relation. Thus, when a variable is newly defined to point to the object, it need only be added to the head of the list. A back edge at which some variables cease to be live requires only dropping variables from the head of the list. We prove that the analysis using the proposed data structure computes the same result as a set-based analysis. We empirically show that the proposed data structure is more efficient in both time and memory requirements than set implementations using hash tables and balanced trees.
Date October 14 , 2008
CS-2008-23
Title Creating Artistic Compositions Using Coalition-Forming Intelligent Agents
Authors Alex Pytel
Abstract Simulation of multi-agent systems (MAS) can be an effective technique for generating randomized compositions out of primitive picture elements. This report presents a complete implementation of a MAS for creating rule-based hierarchical arrangements of 3D objects in a plane and identifes some associated technical challenges, such as attaining good performance and usability.

Date October 14 , 2008

CS-2008-24

Title Policy-Based Security Configuration Management Application to Intrusion Detection and Prevention
Authors Issam Aib, Khalid Alsubhi, Jerome Francois, and Raouf Boutaba
Abstract Intrusion Detection and/or Prevention Systems (IDPS) represent an important line of defense against the variety of attacks that can compromise the security and well functioning of an enterprise information system. IDPSes can be network or host-based and can collaborate in order to provide better detections of malicious traffic. Although several IDPS systems have been proposed, their appropriate configuration and control for effective detection and prevention of attacks has always been far from trivial. Another concern is related to the slowing down of system performance when maximum security is applied, hence the need to trade off between security enforcement levels and the performance and usability of an enterprise information system.
In this paper we motivate the need for and present a policy-based framework for the configuration and control of the security enforcement mechanisms of an enterprise information system. The approach is based on dynamic adaptation of security measures based on the assessment of system vulnerability and threat prediction and provides several levels of attack containment. As an application, we have implemented a dynamic policy-based adaptation mechanism between the Snort signature-based IDPS and the light weight anomaly-based FireCol IDS. Experiments conducted over the DARPA 2000 and 1999 intrusion detection evaluation datasets show the viability of our framework1.

Date October 22, 2008
CS-2008-25
Title A Survey of Network Virtualization
Authors N.M. Mosharaf Kabir Chowdhury and Raouf Boutaba
Abstract

Due to the existence of multiple stakeholders with conflicting goals and policies, alterations to the existing Internet are now limited to simple incremental updates; deployment of any new, radically different technology is next to impossible. To fend off this ossification once and for all, network virtualization has been propounded as a diversifying attribute of the future inter-networking paradigm. By allowing multiple heterogeneous network architectures to cohabit on a shared physical substrate, network virtualization provides flexibility, promotes diversity, and promises security and increased manageability. In this paper, we present a network virtualization model with a set of quintessential design goals, survey the past and the state-of-the-art of network virtualization, and discuss the future challenges that must be addressed to realize a viable network virtualization model.

Date October 15, 2008
CS-2008-26
Title Cauchy’s theorem for orthogonal polyhedra of genus 0
Authors Therese Biedl, Burkay Genc
Abstract A famous theorem by Cauchy states that a convex polyhedron is determined by its incidence structure and face-polygons alone. In this paper, we prove the same for orthogonal polyhedra of genus 0 as long as no face has a hole. Our proof yields a linear-time algorithm to find the dihedral angles.
Date November 14, 2008
CS-2008-27
Title Edge-intersection graphs of k-bend paths in grids
Authors Therese Biedl and Michal Stern
Abstract In this paper, we continue the study of edge-intersection graphs of paths in a grid, which was initiated by Golumbic, Lipshteyn and Stern. We show that for any $k$, if the number of bends in each path is restricted to be at most $k$, then not all graphs can be represented. Then we study some graph classes that can be represented with $k$-bend paths, for small values of $k$.
Date November 14, 2008
CS-2008-28
Title Checking Inevitability and Invariance Using Description Logic Technology
Authors Shoham Ben-David, Richard Trefler, Dmitry Tsarkov and Grant Weddell
Abstract Description Logic is a family of knowledge representation formalisms, mainly used to specify ontologies for information systems. We show how Description Logic can serve as a natural setting for representing and solving symbolic model checking problems. We concentrate on inevitability (AF(p)) specification and invariance (AG(p)) formulas. Experimental results, using the Description Logic reasoner FaCT++, outperform existing methods for inevitability formulas. For invariance formulas we give new encodings that significantly improve on previous implementations using DL technology.
Date November 17 , 2008