CS200601  

Title  SIWF 2Q: WF2Q Approximation with Small Constant Execution Overhead  Extended Version  
Authors  Martin Karsten  
Abstract  This paper presents a novel priority queue data structure and access operations for timer maintenance in the context of traffic shaping and scheduling in packet networks. The data structure and operations are used to construct an efficient General Processor Sharing (GPS) approximation scheduler. In contrast to existing proposals, this scheduler has constant and nearoptimal delay and fairness properties, /and/ can be implemented with bounded amortized O(1) algorithmic complexity, /and/ has very low absolute execution overhead. The paper presents the data structure, the scheduling algorithm, and studies its execution complexity. The scheduling properties are analyzed and shown to relate nicely to existing work. Some illustrative simulation results are presented to reaffirm those properties.  
Date  January 2006  
Report  SIWF 2Q: WF2Q Approximation with Small Constant Execution Overhead  Extended Version (PDF) 
CS200602  

Title  An Investigation on Sieve and Detour Effects Affecting the Interaction of Infrared Radiation with Plant Leaves  
Authors  Gladimir V. G. Baranoski and Denise Eng  
Abstract  The retrieval of plant biophysical and biochemical properties from high spectral resolution data represents an active area of research within the remote sensing field. Scientific studies in this area are usually supported by computational simulations of light attenuation processes within foliar tissues. In heterogeneous organic materials, like plant leaves, sieve and detour effects can affect these processes, and ultimately change the light gradients within these tissues and their spectral signatures. Although these effects have been extensively examined for applications involving the interactions of visible radiation with plant leaves, little is known about their role in the infrared domain. In this paper, we describe the procedural basis for their incorporation in the modelling of infrared radiation transport (in the range of 750 to 2500nm) within plant leaves. We also assess their impact on the predictability of simulation solutions relating the directionality of the incident radiation and the internal arrangement of the tissues to changes on foliar spectral signatures in this domain. Our investigation is grounded by observations involving modeled results and quantitative and qualitative data reported in the literature  
Date  November 2, 2006  
Report  An Investigation on Sieve and Detour Effects Affecting the Interaction of Infrared Radiation with Plant Leaves (PDF) 
CS200603  

Title  Optimal lower bounds for rank and select indexes  
Authors  Alexander Golynski  
Abstract  We develop a new lower bound technique for data structures. We show an optimal $\Omega(n \lg\lg n / \lg n)$ space lower bounds for storing an index that allows to implement rank and select queries on a bit vector $B$ provided that $B$ is stored explicitly. These results improve upon [Miltersen, SODA'05]. We show $\Omega((m/t) \lg t)$ lower bounds for storing rank/select index in the case where $B$ has $m$ $1$bits in it (e.g. low $0$th entropy) and the algorithm is allowed to probe $t$ bits of $B$. We simplify select index given in [Raman et al., SODA'02] and show how to implement both rank and select queries with an index of size $(1 + o(1)) (n \lg\lg n / \lg n) + O(n / \lg n)$ (i.e. we give an explicit constant for storage) in the RAM model with word size $\lg n$.  
Date  February 10, 2006  
Report  Optimal lower bounds for rank and select indexes (PDF) 
CS200604  

Title  Comparisions on Different Approachs to Assign Missing Attribute Values  
Authors  Jiye Li and Nick Cercone  
Abstract  A commonlyused and naive solution to process data with missing attribute values is to ignore the instances which contain missing attribute values. This method may neglect important information within the data, significant amount of data could be easily discarded, and the discovered knowledge may not contain significant rules. Some methods, such as assigning the most common values or assigning an average value to the missing attribute, may make good use of all the available data. However the assigned value may not come from the information which the data originally derived, thus noise is brought to the data. We introduce a new approach RSFit on processing data with missing attribute values based on rough sets theory. By matching attributevalue pairs among the same core or reduct of the original data set, the assigned value preserves the characteristics of the original data set. We compare our approach with "closest fit approach globally" and "closest fit approach in the same concept". Experimental results on UCI data sets and a real geriatric care data set show our approach achieves comparable accuracy on assigning the missing values while significantly reduces the computation time.  
Date  January 2006  
Report  Comparisions on Different Approachs to Assign Missing Attribute Values (PDF) 
CS200605  

Title  A Scalable Peertopeer Protocol Enabling Efficient and Flexible Search  
Authors  Reaz Ahmed and Raouf Boutaba  
Abstract 
Efficient
discovery
of
information,
based
on
partial
knowledge,
is
a
challenging
problem
faced
by
many
large
scale
distributed
systems.
This
paper
presents
a
peertopeer
search
protocol
that
addresses
this
problem.
The
proposed
system
provides
an
efficient
mechanism
for
advertising
a
binary
pattern,
and
discovering
it
using
any
subset
of
its
1bits.
A
pattern
(e.g.,
Bloom
filter)
summarizes
the
properties
(e.g.,
keywords
or
service
description)
associated
with
a
shared
object
(e.g.,
document
or
service).
The proposed system has a partially decentralized architecture involving superpeers and adopts a novel structured search mechanism derived from the theory of Error Correcting Codes (ECC). Better resilience to peer failure is achieved by utilizing replication and redundant routing paths. The number of routing hops and the number of links maintained by each superpeer scales logarithmically with the number of superpeers. The concept presented in this paper is supported with theoretical analysis, and simulation results.  
Date  January 2006  
Report  A Scalable Peertopeer Protocol Enabling Efficient and Flexible Search (PDF) 
CS200606  

Title  The RAcyclic Semiunification Problem  
Authors  Brad Lushman, Gordon V. Cormack  
Abstract  We recast Kfoury and Wells' formulation of the acyclic semiunification problem (ASUP) in graph theoretic terms and prove equivalence between the two formulations. We then relax and simplify the graphtheoretic formulation; we call the resulting problem the Racyclic semiunification problem (RASUP), which we show to be a strict superset of ASUP. We prove that the ASUP solution procedure terminates and produces most general solutions for RASUP (and hence for ASUP) in the same sense as Robinson's unification algorithm. We thus extend the class of semiunification instances known to be decidable.  
Date  March 13, 2006  
Report  The RAcyclic Semiunification Problem (PDF) 
CS200607  

Title  FIX: Featurebased Indexing Technique for XML Documents  
Authors  Ning Zhang, M. Tamer Ozsu, Ihab F. Ilyas, and Ashraf Aboulnaga  
Abstract  In this paper, we study the problem of indexing an XML database. Existing XML indexing techniques focus on clustering methods based on the combinatorial structural properties of an XML document. These techniques cluster tree nodes into an index tree or graph based on their similarities in ancestordescendant or sibling relationships. Index lookup then amounts to pattern matching on the clustered tree or graph. In this paper, we propose a featurebased indexing technique, called FIX, based on the spectral graph theory. The basic idea is that for each twig pattern in a collection of XML documents, we calculate a vector of features based on its structural properties. These features are used as a key for the patterns and stored in a Btree or a multidimensional index tree. Given an XPath query, its feature vector is first calculated and looked up in the index. Then a further refinement phase is performed to fetch the final results. We experimentally study the indexing technique over two scenarios: a large collection of relatively smaller documents, and a single large document. Our experiments show that FIX provides great pruning power and could gain an order of magnitude performance improvement for many XPath queries over existing evaluation techniques.  
Date  March 14, 2006  
Report  FIX: Featurebased Indexing Technique for XML Documents (PDF) 
CS200608  

Title 
A
<I>More</I>
Direct
Algorithm
for
Type
Inference
in
the
Rank2
Fragment of the SecondOrder LambdaCalculus  
Authors  Brad Lushman, Gordon V. Cormack  
Abstract 
We present an algorithm for rank2 type inference in the secondorder $\lambda$calculus. Our algorithm differs from the wellknown algorithm of Kfoury and Wells in that it employs only a quadratically fewer type variables and inequalities. Our algorithm consists of a translation from a $\lambda$term to an instance of $R$ASUP (a decidable superset of ASUP) in which the variables correspond more directly to features in the original term. We claim that our construction, being simpler and more direct, is more amenable to proof and extension.  
Date  March 29, 2006/Revised July 27, 2007  
Report  A <I>More</I> Direct Algorithm for Type Inference in the Rank2 Fragment (PDF) 
CS200609  

Title  Robust Numerical Valuation of European and American Options under the CGMY Process  
Authors  Iris R. Wang, Justin W.L. Wan and Peter A. Forsyth.  
Abstract  We develop an implicit discretization method for pricing European and American options when the underlying asset is driven by an infinite activity Levy process. For processes of finite variation, quadratic convergence is obtained as the mesh and time step are refined. For infinite variation processes, better than first order accuracy is achieved. The jump component in the neighbourhood of log jump size zero is specially treated by using a Taylor expansion approximation and the advection term is dealt with using a semiLagrangian scheme. The resulting Partial IntegroDifferential Equation (PIDE) is then solved using preconditioned BiCGSTAB method coupled with a fast Fourier transform. Proofs of fully implicit timestepping stability and monotonicity are provided. The convergence properties of the BiCGSTAB scheme are discussed and compared with a fixed point iteration. Numerical tests on convergence and performance of European and American options under processes of finite and infinite variation are presented.  
Date  April 7, 2006  
Report  Robust Numerical Valuation of European and American Options under the CGMY Process (PDF) 
CS200610  

Title  Succinct Encodings for XPath Location Steps  
Authors  JC)rC)my Barbay and S. Srinivasa Rao  
Abstract  We consider in this paper the problem of encoding XML documents in small space while still supporting XPath Location steps efficiently. We model XML documents as multilabeled trees, and propose for those an encoding which takes space close to the lower bound suggested by information theory, while still supporting the search for the ancestors, descendants and children matching a given label efficiently. To achieve this goal, we extend the previous results from Golynski et al. on strings over large alphabets, and from Barbay et al. on binary relations, to support the rank and select operators in several orders at once; and we extend the results from Geary et al. on ordinal trees to support the DFUDS order, in addition to the other operators.  
Report  Succinct Encodings for XPath Location Steps (PDF) 
CS200611  

Title  Adaptive Search Algorithm for Patterns, in Succinctly Encoded XML  
Authors  JC)rC)my Barbay  
Abstract  We propose an adaptive algorithm for context queries (queries expressed as preorder and ancestordescendant relations on labeled nodes), which can be used to find patterns in XML documents. Our algorithm takes advantage of the correlation between terms of the query without any preprocessed information, and it runs in time (kd(lg lg min(n,s)+lg lg(r))) in the RAM model, where k is the number of terms in the query, d is the nondeterministic complexity of the query on the multilabeled tree (i.e. the minimum number of operations required to check the answer to the query), n is the number of nodes in the tree, s is the number of relations between nodes and labels, and r is the maximal number of nodes matching a label on any rooted path in the tree.  
Date  April 21, 2006  
Report  Adaptive Search Algorithm for Patterns, in Succinctly Encoded XML (PDF) 
CS200612  

Title  Age and Geographic Inferences of the LiveJournal Social Network  
Authors  Ian McKinnon, Rob Warren  
Abstract  Online social networks are often a byproduct of blogging and other online media sites on the Internet. Services such as LiveJournal allow their users to specify who their "friends" are, and thus a social network is formed. This paper will explore the relationship between users with the intent of being able to make a prediction of a users age and country of residence based on the information given by their friends on this social network.  
Date  June, 2006  
Report  Age and Geographic Inferences of the LiveJournal Social Network (PDF) 
CS200613  

Title  Predicting Missing Attribute Values based on Frequent Itemset and RSFit  
Authors  Jiye Li and Nick Cercone  
Abstract  How to process missing attribute values is an important data preprocessing problem in data mining and knowledge discovery tasks. A commonlyused and naive solution to process data with missing attribute values is to ignore the instances which contain missing attribute values. This method may neglect important information within the data and a significant amount of data could be easily discarded. Some methods, such as assigning the most common values or assigning an average value to the missing attribute, make good use of all the available data. However the assigned value may not come from the information which the data originally derived from, thus noise is brought to the data. We introduce an integrated approach ItemRSFit to effectively predict missing attribute values by combining frequent itemset and RSFit approaches together. Frequent itemset is generated from the association rules algorithm and it displays the correlations between different items in a transaction data set. Using frequent itemset as a knowledge base to predict missing attribute values is shown to have a high prediction accuracy. However this approach alone cannot predict all the existing missing attributes. RSFit is a newly developed approach to predict missing attribute values based on the similarities of attributevalue pairs by only considering attributes contained in the core or the reduct of the data set. The RSFit approach provides a faster prediction and can be used for the cases that are not covered by the itemset approach. Empirical studies on UCI data sets and a real world data set demonstrate a significant increase of predicting accuracy obtained from this new integrated approach.  
Date  April 25, 2006  
Report  Predicting Missing Attribute Values based on Frequent Itemset and RSFit (PDF) 
CS200614  

Title  Seeking Empirical Evidence for SelfOrganized Criticality in Open Source Software Evolution  
Authors  Jingwei Wu, Richard C. Holt  
Abstract  We examine 11 open source software systems and present empirical evidence for the existence of fractal structures in software evolution. In our study, fractal structures are measured as power laws through the lifetime of a software system. We describe two specific power law related phenomena: the probability distribution of software changes decreases as a power function of change sizes; and the time series of software change exhibits long range correlations with power law behavior. The existence of such spatial (across the system) and temporal (over the system lifetime) power laws suggests that SelfOrganized Criticality (SOC) occurs in the evolution of open source systems. As a consequence, SOC may be useful as a conceptual framework for understanding software evolution dynamics (the cause and mechanism of change or growth). We give a qualitative explanation of software evolution based on SOC. We also discuss some implications of SOC to current software practices.  
Date  April 26, 2006  
Report  Seeking Empirical Evidence for SelfOrganized Criticality in Open Source Software Evolution (PDF) 
CS200615  

Title  Collaborative and Coordinated Product Configuration  
Authors  Marcilio Mendonca, Toacy Oliveira, Donald Cowan  
Abstract  Product configuration is a key activity of product engineering that regards the constrained combination and parameterization of product line assets as a means to achieve correct software specification. Current product configuration approaches frequently rely on the application engineer to translate user requirements into correct configuration choices. This process is errorprone and risky as requirements may lead to conflicting decisions at configuration time. Indeed, we deem that an important aspect of product configuration has long been neglected: its collaborative nature. In our research, we advocate that product configuration is enhanced by a collaborative perspective, providing that conflicting scenarios are properly handled. We propose an approach to support collaborative and coordinated product configuration by promoting processes to firstorder elements for the explicit guidance of configuration decisions. We provide insights on important coordination issues and introduce an algorithm to derive process models from annotated feature models to illustrate the approach's feasibility. Keywords: Product Configuration, Software Processes, Software Product Lines, Collaborative Software Configuration.  
Date  May 17, 2006  
Report  Collaborative and Coordinated Product Configuration (PDF) 
CS200616  

Title  Geometrical Mesh Improvement Properties of Delaunay Terminal Edge Refinement  
Authors 
Bruce
Simpson,
David
Cheriton
School
of
Computer
Science,
University
of
Waterloo,
Waterloo,
Ontario,
Canada,
N2L
3G1 and MariaCecilia Rivars, Department of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile  
Abstract  "The use of edge based refinement in general, and Delaunay terminal edge refinement in particular are well established for planar meshing, but largely on a heuristic basis. In this paper, we present a series of theoretical results on the geometric mesh improvement properties of these methods. The discussion is based on refining a mesh to meet a specified angle tolerance."  
Date  July 25, 2006  
Report  Geometrical Mesh Improvement Properties of Delaunay Terminal Edge Refinement (PDF) 
CS200617  

Title  A Program Extractor Suite for C and C++: Choosing the Right Tool for the Job  
Authors  Jingwei Wu, Richard C. Holt  
Abstract  This report describes a suite of program extractors, which we developed by adopting and extending existing program parsing and extraction techniques or tools. This suite is called CX because it is mainly targeted at extracting facts from C and C++ programs. This suite is currently composed of four extractors: CPPX, BFX, LDX and CTSX. The main goal of creating CX is to provide a convenient set of program extractors, which complement each other and work in a systematic manner. The benefits of this extractor suite will be discussed in terms of two practical applications: (1) creating program comprehension pipelines to support various understanding tasks, and (2) building an open source software evolution database (EvolDB) to support empirical research on software evolution.  
Date  June 3, 2006  
Report  A Program Extractor Suite for C and C++: Choosing the Right Tool for the Job (PDF) 
CS200618  

Title  A Survey of Data Management in PeertoPeer Systems  
Authors  Rolando Blanco, Nabeel Ahmed, David Hadaller, L. G. Alex Sung, Herman Li, and Mohamed Ali Soliman  
Abstract  PeertoPeer (P2P) systems provide a decentralized infrastructure for resource sharing. In particular, file sharing was the initial motivation behind many of the first successful P2P systems. As P2P systems evolve to support sharing of structured and semantically rich data, several data management issues must be addressed. Specifically, P2P systems need to deal with data location, data integration, data querying, and data consistency issues. Data management is further complicated in P2P systems due to the volatility and scale of these systems. In this survey we propose a reference architecture for P2P systems that focuses on the data management activities required to support sharing of structured data. Based on this reference architecture, we present and classify technologies that have been proposed to solve the data management issues particular to P2P systems.  
Date  June 20 , 2006  
Report  A Survey of Data Management in PeertoPeer Systems (PDF) 
CS200619  

Title  Incremental Maintenance of Global Aggregates  
Authors  Nabeel Ahmed, David Hadaller, and Srinivasan Keshav  
Abstract  Providing local access to global information can improve the efficiency of many distributed applications. Examples include applications that aggregate sensor values, search in PeertoPeer systems, or perform TopK queries in streamoriented databases. Efficient computation of such aggregates is difficult due to the massive scale and dynamics of such systems and has led to the proposal of several approximate techniques based on randomized gossip algorithms. However, maintenance of such aggregates has not been adequately addressed in the literature. Changes in node state, therefore, require a full and expensive recomputation of the global aggregate. This paper makes three contributions to this field. First, we propose a variant of the wellknown FM aggregation scheme that allows us to support incremental maintenance of aggregates. Second, we propose the concept of significance thresholds and illustrate their benefits. Finally, we present a detailed performance evaluation of our techniques and find that we can reduce computation time by 60% compared to recomputing the aggregate, as is traditionally done.  
Date  June 21, 2006  
Report  Incremental Maintenance of Global Aggregates (PDF) 
CS200620  

Title  "On Pattern Expression Languages"  
Authors  Cezar Campeanu and Nicolae Santean  
Abstract 
In
this
paper
we
show
that
the
family
of
pattern
expression
languages
is
closed
under
the
intersection
with
regular
languages.
Since
this
family
is
not
closed
under
complement
but
is
closed
under
reverse,
a
natural
question
arises,
that
is,
whether
particular
languages
such
as
those
containing
words
of
type
$ww^R$
are
pattern
expression
languages
or
not.
We
give
a
proof
for
a
negative
answer
to
this
question,
and
we
provide
several
examples
of
languages
which
can
not
be
specified
by
pattern
expressions.
Keywords: pattern, pattern automata, pattern expressions, regex, regular expressions  
Date  December 11, 2006  
Report  "On Pattern Expression Languages" (PDF) 
CS200621  

Title  A Robust Overlay for Distributed Range Query Processing  
Authors 
AndrC)
Allavena
Qiang
Wang
Ihab
Ilyas
Srinivasan
Keshav University of Waterloo {aallavena, q6wang, ilyas, keshav}@uwaterloo.ca  
Abstract  Largescale datacentric services are often handled by clusters of computers that include hundreds of thousands of computing nodes. However, traditional distributed query processing techniques fail to handle the largescale distribution, peertopeer communication and frequent disconnection. In this paper, we introduce LOT, a robust, faulttolerant and highly distributed overlay network for largescale peertopeer query processing. LOT is based on a robust tree overlay for distributed systems. It uses virtualization, replication, geographicbased clustering and flexible state definition as basic design principles. We show how we map these principles to desirable performance goals. Moreover, we provide a lightweight maintenance mechanism for updating state information. Analysis and simulations show that our approach is superior to other wellknown alternatives in its query processing performance and handling of churn.  
Date  July 14, 2006  
Report  A Robust Overlay for Distributed Range Query Processing (PDF) 
CS200622  

Title  Fast, Efficient and Robust InNetwork Computation  
Authors 
AndrC)
Allavena
and
Srinivasan
Keshav University of Waterloo {aallaven,keshav}@uwaterloo.ca}  
Abstract 
Today,
companies
such
as
eBay,
Amazon,
Google,
and
IBM
routinely
operate
clusters
with
more
than
10,000
servers
located
in
data
centres
around
the
world.
Developing
applications
that
efficiently
use
the
resources
of
such
a
distributed
cluster
is
challenging.
This
task
is
made
eased
by
a
middleware
layer
that
provides
application
programmers
with
the
illusion
of
dynamically
updated
"global
state".
Maintaining
global
state
can
be
viewed
as
repeatedly
computing
an
arbitrary
function
over
a
set
of
timevarying
values,
where
the
values
are
held
at
each
node,
and
every
node
needs
to
know
the
resultant
function.
Many
wellknown
problems
in
distributed
systems,
such
as
load
balancing,
leader
election,
barrier
synchronisation,
distributed
file
system
maintenance,
and
coordinated
intrusion
detection
can
be
cast
in
this
form.
We present and rigorously analyse an algorithm that uses a a tree of virtual nodes to compute nearly arbitrary functions of global state. Our scheme is fast: running time is $\Theta(ln n)$. It is efficient: nodes exchange O(n ln n) messages. Most of these messages are within a data centre and therefore are relatively cheap. It is accurate: the computed value does not have inherent errors either due to doublecounting, as with standard gossip, or due to stochastic counting, as in the FlajoletMartin approach. Finally, it is fault tolerant: The algorithm fails with probability O( 1 / n^{c(1+r)} )$ where c is a constant and 0 <= r \leq (\ln n)^Cst a userchosen reliability parameter. We therefore believe that our work is the basis for building robust and efficient distributed systems in a variety of problem domains.  
Date  July 14, 2006  
Report  Fast, Efficient and Robust InNetwork Computation (PDF) 
CS200623  

Title  Weighted Queries on Binary Relations and MultiLabeled Trees  
Authors  Jeremy Barbay, Aleh Veraskouski  
Abstract  The common way to search large indexed databases is through conjunctive queries on binary relations, such as those associating keywords to web page references. We extend this model by adding positive weights to the terms of the query. In this context we give and analyze two algorithms to answer such queries in a pertinent way. We extend this approach to solve weighted queries on treestructured objects such as filesystems or XML documents.  
Date  July 27, 2006  
Report  Weighted Queries on Binary Relations and MultiLabeled Trees (PDF) 
CS200624  

Title  Succinct Indexes for Strings, Binary Relations and Multilabeled Trees  
Authors  Jeremy Barbay, Meng He, J. Ian Munro, and S. Srinivasa Rao  
Abstract 
We
define
and
design
succinct
indexes
for
several
abstract
data
types
(ADTs).
The
concept
is
to
design
auxiliary
data
structures
called
succinct
indexes
that
occupy
asymptotically
less
space
than
the
informationtheoretic
lower
bound
on
the
space
required
to
encode
the
given
data,
and
support
an
extended
set
of
operations
using
the
basic
operators
defined
in
the
ADT.
As opposed to succinct encodings, The main advantage of succinct indexes is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data.In this paper, we present succinct indexes for various data types, namely strings, binary relations and multilabeled trees. Given the support for the interface of the ADTs of these data types, we can support various useful operations efficiently by constructing succinct indexes for them. When the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Using our techniques, we design the first succinct encoding that represents a string of length $n$ over alphabet $[\sigma]$ using $nH_k+o(n\lg \sigma)$ bits \footnote{We use $\lg n$ to denote $\lceil \log_2 n \rceil$.} that support access / rank / select operations in $o((\lg\lg \sigma)3)$ time. We also design the first succinct text index using $nH_k+o(n\lg\sigma)$ bits that supports pattern matching queries in $O(m\lg\lg\sigma + occ\lg^{1+\epsilon}n\lg\lg\sigma)$ time, for a given pattern of length $m$. Previous results on these two problems either have a $\lg\sigma$ factor instead of $\lg\lg\sigma$ in terms of running time, or are not compressible, but our results do not have such problems.  
Date  July 27, 2006  
Report  Succinct Indexes for Strings, Binary Relations and Multilabeled Trees (PDF) 
CS200625  

Title  Topk Query Processing in Uncertain Databases  
Authors  Mohamed A. Soliman, Ihab F. Ilyas, Kevin ChenChuan Chang  
Abstract 
Topk processing in uncertain databases is semantically and computationally different from traditional topk processing. The interplay between score and uncertainty information makes traditional processing techniques inapplicable to uncertain databases. In this paper we introduce new probabilistic formulations for topk queries. Our formulations are based on marriage of traditional topk semantics with possible worlds semantics. In the light of these formulations, we construct a framework that encapsulates a novel probabilistic model and efficient query processing techniques to tackle the challenges raised by uncertain data settings. We prove that our techniques are optimal in terms of the number of accessed tuples. Our experiments show the efficiency of our techniques under different data distributions with orders of magnitude improvement over naive materialization of possible worlds.  
Date  August 2, 2006 /Updated August 14, 2007  
Report  Topk Query Processing in Uncertain Databases (PDF) 
CS200626  

Title  MultiQuery Optimization of Sliding Window Aggregates by Schedule Synchronization  
Authors  Lukasz Golab, Kumar Gaurav Bijay, and M. Tamer Ozsu  
Abstract 
Data
stream
systems
process
persistent
queries,
typically
posed
over
sliding
windows
and
reevaluated
periodically
as
the
windows
slide
forward.
Due
to
their
longrunning
nature,
a
number
of
similar
persistent
queries
may
run
in
parallel
at
any
given
time,
therefore
multiquery
optimization
is
particularly
important.
In
traditional
multiquery
optimization,
one
of
the
primary
goals
is
to
detect
common
parts
across
multiple
queries
issued
at
the
same
time
and
perform
the
common
task
only
once.
Another
related
goal
is
to
check
if
a
new
query
may
be
answered
using
the
answer
of
a
related
query
which
has
been
computed
previously
(and
is
stored
as
a
materialized
view).
In
this
paper,
we
argue
that
in
the
context
of
periodically
reevaluated
queries,
multiquery
optimization
requires
an
additional
step
beyond
common
subexpression
matching.
This
is
because
queries
that
have
been
identified
as
similar
may
be
reevaluated
with
different
frequencies
and
therefore
may
be
scheduled
at
different
times.
Thus,
the
additional
step
must
attempt
to
synchronize
the
reexecution
times
of
similar
queries
in
order
to
take
advantage
of
computation
sharing.
The solution presented in this paper focuses on periodicallyevaluated aggregates over sliding windows of various lengths, which are a common class of persistent queries used for monitoring purposes. The proposed solution assumes that users specify an upper bound on the interval between reevaluations of their queries and is based upon the following insight: it may be cheaper to reexecute some queries more often if their reexecution schedules can be synchronized with those of similar queries, thereby amortizing the computation costs. We also show that additional schedule synchronization is possible when the system is forced to lengthen the desired reexecution intervals during periods of overload. Our solutions are based upon a variant of the earliestdeadlinefirst algorithm that views persistent queries are periodic tasks. Experimental results show significant improvements in system throughput due to increased resource sharing.  
Date  August, 2006  
Report  MultiQuery Optimization of Sliding Window Aggregates by Schedule Synchronization (PDF) 
CS200627  

Title  Sliding Window Query Processing over Data Streams  
Authors  Lukasz Golab  
Abstract 
Database
management
systems
(DBMSs)
have
been
used
successfully
in
traditional
business
applications
that
require
persistent
data
storage
and
an
efficient
querying
mechanism.
Typically,
it
is
assumed
that
the
data
are
static,
unless
explicitly
modified
or
deleted
by
a
user
or
application.
Database
queries
are
executed
when
issued
and
their
answers
reflect
the
current
state
of
the
data.
However,
emerging
applications,
such
as
sensor
networks,
realtime
Internet
traffic
analysis,
and
online
financial
trading,
require
support
for
processing
of
unbounded
data
streams.
The
fundamental
assumption
of
a
data
stream
management
system
(DSMS)
is
that
new
data
are
generated
continually,
making
it
infeasible
to
store
a
stream
in
its
entirety.
At
best,
a
sliding
window
of
recently
arrived
data
may
be
maintained,
meaning
that
old
data
must
be
removed
as
time
goes
on.
Furthermore,
as
the
contents
of
the
sliding
windows
evolve
over
time,
it
makes
sense
for
users
to
ask
a
query
once
and
receive
updated
answers
over
time.
This dissertation begins with the observation that the two fundamental requirements of a DSMS are dealing with transient (timeevolving) rather than static data and answering persistent rather than transient queries. One implication of the first requirement is that data maintenance costs have a significant effect on the performance of a DSMS. Additionally, traditional query processing algorithms must be reengineered for the sliding window model because queries may need to reprocess expired data and "undo" previously generated results. The second requirement suggests that a DSMS may execute a large number of persistent queries at the same time, therefore there exist opportunities for resource sharing among similar queries. The purpose of this dissertation is to develop solutions for efficient query processing over sliding windows by focusing on these two fundamental properties. In terms of the transient nature of streaming data, this dissertation is based upon the following insight. Although the data keep changing over time as the windows slide forward, the changes are not random; on the contrary, the inputs and outputs of a DSMS exhibit patterns in the way the data are inserted and deleted. It will be shown that the knowledge of these patterns leads to an understanding of the semantics of persistent queries, lower window maintenance costs, as well as novel query processing, query optimization, and concurrency control strategies. In the context of the persistent nature of DSMS queries, the insight behind the proposed solution is that various queries may need to be refreshed at different times, therefore synchronizing the refresh schedules of similar queries creates more opportunities for resource sharing.  
Date  August, 2006  
Report  Sliding Window Query Processing over Data Streams (PDF) 
CS200628  

Title  "Deterministic Simulation of a NFA with ksymbol Lookahead"  
Authors  Bala Ravikumar and Nicolae Santean  
Abstract  We investigate deterministically simulating (i.e., solving the membership problem for) nondeterministic finite automata (NFA), relying solely on the NFA's resources (states and transitions). Unlike the standard NFA simulation, involving an algorithm which stores at each step all the states reached nondeterministically while reading the input, we consider deterministic finite automata (DFA) with lookahead, which choose the "right" NFA transitions based on a fixed number of input symbols read ahead. This concept, known as lookahead delegation, arose in a formal study of web services composition and its subsequent practical applications. Here we answer several related questions, such as "when is lookahead delegation possible?" and "how hard is it to find a delegator with a given lookahead buffer size?". In particular, we show that only finite languages have the property that all of their NFA's have delegators. This implies, among others, that delegation is a machine property, rather than a language property. We also prove that the existence of lookahead delegators for unambiguous NFA is decidable, thus partially solving an open problem. Finally, we show that finding delegators (even for a given buffer size) is hard in general, and is efficient for unambiguous NFA, and we give an algorithm and a compact characterization for NFA delegation in general.  
Date  August, 2006  
Report  "Deterministic Simulation of a NFA with ksymbol Lookahead" (PDF) 
CS200629  

Title  Query Processing and Optimization in Native XML Databases  
Author  Ning Zhang  
Abstract 
XML has emerged as a semantic markup language for documents as well as the de facto language for data exchange over the World Wide Web. Declarative query languages, such as XPath and XQuery, are proposed for querying over large volumes of XML data. A number of techniques have been proposed to evaluate XML queries more efficiently. Many of these techniques assume a tree model of XML documents and are, therefore, also applicable to other data sources that can be explicitly or implicitly translated into a similar data model. The focus of this thesis is on efficient evaluation and optimization of path expressions in native XML databases. Specifically, the following issues are considered: storage system design, design of physical operators and efficient execution algorithms, and the costbased query optimizer. The proposed storage system linearizes the tree structure into strings that can be decomposed into disk pages. Simple statistics are kept in the page headers to facilitate I/Oefficient navigation. Based on this storage system, a hybrid approach is developed to evaluate path expressions that exploit the advantages of navigational and joinbased approaches. Path expressions that only contain "local" axes (child, parent, attribute, self, followingsibling, and precedingsibling) are evaluated by means of the proposed "NextofKin" (NoK) operator. A general path expression that contains both local axes and global ones (ancestor, descendant, ancestororself, descendantorself, following, and preceding) is decomposed into NoK subtrees whose intermediate results are structurally joined to produce the final result. Experiments show that the navigational operator can be an order of magnitude faster than joinbased approaches in some cases, but slower in others. Thus a costbased query optimizer is necessary to choose the optimal execution plan based on estimates of the cost of each operator. The cost of an operator heavily depends on the cost model and its input. The inputs to the cost model are usually the cardinalities of path expressions. In this thesis, a synopsis structure called XSEED is proposed to estimate the cardinality of a path expression. An XSEED synopsis can be constructed by compressing an XML document to a small kernel first, and then more information can be added to the synopsis. XSEED results in more accurate cardinality estimation than previous approaches and is easier to construct, easier to update, and can incorporate query feedback. Efficient query execution exploits indexes. The last component of the thesis is a featurebased structural index, called FIX, to expedite tree matching on a large tree. This is based on the observation that the navigational operation is expensive and applying it to every tree node is very inefficient. FIX extracts distinctive features from subtrees and uses them as the index keys. Similarly, for each incoming query, the features of the query tree are extracted and used as search keys to retrieve the candidate results. Thus, it is sufficient to match only on these candidate results. The experimental results show that the pruning power of FIX is very high  more than 99% for structurerich data sets, and more than 20% for data sets with less structural variety.  
Date  August, 2006  
Report  Query Processing and Optimization in Native XML Databases (PDF) 
CS200630  

Title  "On the definition of stochastic lambdatransducers"  
Authors  Stavros Konstantinidis and Nicolae Santean  
Abstract 
We
propose
a
formal
definition
for
the
general
notion
of
stochastic
transducer,
called
stochastic
lambdatransducer.
Our
definition
is
designed
with
two
objectives
in
mind:
(i)
to
extend
naturally
the
established
notion
of
stochastic
automaton
with
output

as
defined
in
the
classic
books
of
Paz
(1971)
and
Starke
(1972)

by
permitting
pairs
of
inputoutput
words
of
different
lengths;
(ii)
to
be
compatible
with
the
more
general
notion
of
weighted
transducer
so
that
one
can
apply
tools
of
weighted
transducers
to
address
certain
computational
problems
involving
stochastic
transducers.
The
new
transducers
can
be
used
to
model
stochastic
inputoutput
processes
that
cannot
be
modeled
using
classic
stochastic
automata
with
output.
Keywords: Probabilistic transducer, Probabilistic automaton, Stochastic transducer, Stochastic automaton, Stochastic transduction, Weighted transducer, Transducer, Automaton  
Date  September 5, 2006  
Report  "On the definition of stochastic lambdatransducers" (PDF) 
CS200631  

Title  Generalized Labeled LCA Queries  
Authors  Jeremy Barbay, Ehsan Chiniforooshan, Alexander Golynski, JuiYi Kao, Aleh Veraskouski  
Abstract  Schemafree queries permit to search XML documents without knowing their schema. Among them, lowest common ancestor (LCA) queries were introduced in several variants on labeled trees. We define threshold LCA queries to generalize all those variants, and to extend them to the case where weights are assigned to each term of the query. We study how to solve those in the context where access to the document is streamed, and in the context where the document is accessed through a precomputed index. We propose spaceefficient algorithms for both contexts, using space O(h+s) independant from the size of the document, where h is the height of the input document and s is the number of different labels. We describe two distinct algorithms in the streamed model, both of which read the input stream exactly once and run in linear time, and one algorithm in the indexed model, which provably runs in sublinear time for many instances, characterized by a difficulty measure.  
Date  Revised July 2007  
Report  Generalized Labeled LCA Queries (PS)  Generalized Labeled LCA Queries (PDF) 
CS200632  

Title  Graph Isomorphism Completeness for Perfect Graphs and Subclasses of Perfect Graphs  
Authors  C. Boucher and D. Loker  
Abstract  A problem is said to be GIcomplete if it is provably as hard as graph isomorphism; that is, there is a polynomialtime Turing reduction from the graph isomorphism problem. It is known that the GI problem is GIcomplete for some special graph classes including regular graphs, bipartite graphs, chordal graphs and split graphs. In this paper, we prove that deciding isomorphism of double split graphs, the class of graphs exhibiting a 2join, and the class of graphs exhibiting a balanced skew partition are GIcomplete. Further, we show that the GI problem for the larger class including these graph classesthat is, the class of perfect graphsis also GIcomplete.  
Date  September 6, 2006  
Report  Graph Isomorphism Completeness for Perfect Graphs and Subclasses of Perfect Graphs (PDF) 
CS200633  

Title  Expected Approximation guarantees for the Demand Matching Problem  
Author  C. Boucher, D. Loker  
Abstract  The objective of the demand matching problem is to obtain the subset $M$ of edges which is feasible and where the sum of the profits of each of the edges is maximized. The set $M$ is feasible if for each vertex $v$ the total demand of edges in $M$ incident to $v$ is at most $b_v$. In the case where each of the edges has one unit profit, the problem becomes finding the subset of largest size and hence, is called the cardinality problem. Shepherd and Vetta [SV06] demonstrate that the integrality gap for the general demand matching problem for bipartite graphs is between $2.5$ and $2.764$, and between $3$ and $3.264$ nonbipartite graphs. We demonstrate that an expected $2.5$approximation guarantee and $3$approximation guarantee is achieveable for bipartite graphs and nonbipartite graphs and give some connections to the independent set and weighted independent set problem.  
Date  September 18, 2006  
Report  Expected Approximation guarantees for the Demand Matching Problem (PDF) 
CS200634  

Title  Metro: An Analysis Toolkit for Template Semantics  
Authors 
Joanne
M.
Atlee,
Nancy
A.
Day, Jianwei Niu, Eunsuk Kang, Yun Lu, David Fung, Leonard Wong  
Abstract  We describe the Metro toolkit, which supports software modelling and analysis for requirements notations that have configurable semantics. Metro is based on a formalism, called template semantics, which structures the operational semantics of a family of notations as a predefined parameterized template that is instantiated with userprovided parameter values. Thus, the semantics of a single notation can be expressed succinctly as a set of parameter values to this template. The Metro toolkit takes as input a specification and a set of templateparameter values, and it produces an analyzable model. The toolkit can either translate the specification into the input language of an existing model checker (e.g., SMV), or compile the specification into a more primitive form (e.g., logic, BDDs) that is suitable for analysis. MagicDraw is used as a frontend for editing specifications and animating SMVgenerated counterexamples.  
Date  September 22 , 2006  
Report  Metro: An Analysis Toolkit for Template Semantics (PDF) 
CS200635  

Title  Formal Verification of the A7E Software Requirements using Template Semantics  
Authors  Eunsuk Kang and Nancy A. Day  
Abstract  Template semantics is a templatebased approach to ease the process of identifying the essential differences among modelbased notations. In this approach, a template captures semantics that are common among notations and allows users to specify only the distinctive features of a notation. In this paper, we illustrate the method of describing requirements in Software Cost Reduction (SCR) using the Metro toolkit, which is the framework for the modelling and analysis of notations in template semantics. Furthermore, we demonstrate the usage of Metro to verify the A7E software requirements and compare our verification effort to an alternative method of requirements analysis, which does not use template semantics.  
Date  September 22 , 2006  
Report  Formal Verification of the A7E Software Requirements using Template Semantics (PDF) 
CS200636  

Title  Improved Bounds for the Online Steiner Tree Problem in Graphs of Bounded EdgeAsymmetry  
Author  Spyros Angelopoulos  
Abstract  In this paper we consider the Online Steiner Tree problem in weighted directed graphs of bounded edgeasymmetry $\alpha$. The edgeasymmetry of a directed graph is defined as the maximum ratio of the cost (weight) of antiparallel edges in the graph. The problem has applications in multicast routing over a network with nonsymmetric links. We improve the previously known upper and lower bounds on the competitive ratio of any deterministic algorithm due to Faloutsos et al. In particular, we show that a better analysis of a simple greedy algorithm yields a competitive ratio of $O(\min \{k, \frac{\alpha \log k} {\log \log \alpha}\})$, where $k$ denotes the number of terminals requested. On the negative side, we show a lower bound of $\Omega(\min \{k^{1\epsilon}, \frac{\alpha \log k}{\log \log k} \})$ on the competitive ratio of every deterministic algorithm for the problem, for any arbitrarily small constant $\epsilon$.  
Date  October 2, 2006  
Report  Improved Bounds for the Online Steiner Tree Problem in Graphs of Bounded EdgeAsymmetry (PDF) 
CS200637  

Title  Optimal Superblock Instruction Scheduling for MultipleIssue Processors using Constraint Programming  
Authors  Abid M. Malik, Tyrel Russell, Michael Chase, and Peter van Beek  
Abstract  Modern processors have multiple pipelined functional units and can issue more than one instruction per clock cycle. This puts great pressure on the instruction scheduling phase in a compiler to expose maximum instruction level parallelism. Instruction level parallelism (ILP) at the local level using basic blocks is limited. Compilers increase ILP by doing instruction scheduling at the global level using larger regions, which are created by combining basic blocks. Superblocks are one of the most commonly used scheduling regions for global instruction scheduling. Superblock scheduling is NPcomplete, and is done suboptimally in production compilers using heuristic approaches. In this work, we present a constraint programming approach to the superblock instruction scheduling problem that is both optimal and fast enough to be incorporated into production compilers. We experimentally evaluated our optimal scheduler on the SPEC2000 integer and floating point benchmarks. On this benchmark suite, the optimal scheduler was very robust and scaled to the largest superblocks. Depending on the architectural model, between 99.991% to 99.999% of all superblocks were solved to optimality. The scheduler was able to routinely solve the largest superblocks, including superblocks with up to 2600 instructions. This compares favourably to the recent best work by Shobaki and Wilken on optimal superblock scheduling using dynamic programming and enumeration.  
Date  October 17, 2006  
Report  Optimal Superblock Instruction Scheduling for MultipleIssue Processors using Constraint Programming (PDF) 
CS200638  

Title  Study of the Meiotic Recombination Hotspot Diffusion Paradox  
Authors  Jérémy Barbay  
Abstract  Zoologists and evolutionists ponder about an apparent paradox in the current model of how sexual reproduction happens, basing their conclusions on extensive simulations. We show through a mathematical analysis that the results of those simulations can be predicted for a larger class of models, and we deduce from this analysis one of the key features of the model which yield the paradox. Based on this analysis, we define another mathematical model which solves this paradox, and we check our results through simulation.  
Date  October 10, 2006  
Report  Study of the Meiotic Recombination Hotspot Diffusion Paradox (PS)  Study of the Meiotic Recombination Hotspot Diffusion Paradox (PDF) 
CS200639  

Title  Caspian: A QoSAware Deployment Approach for Channelbased Componentbased Applications  
Authors  Abbas Heydarnoori  
Abstract  With significant advances in software development technologies in recent years, it is now possible to have complex software applications that include a large number of heterogeneous software components distributed over a large network of computers with different computational capabilities. To run such applications, their components must be instantiated on proper hardware resources in their target environments so that requirements and constraints are met. This process is called software deployment. For large, distributed, componentbased applications with many constraints and requirements, it is difficult to do the deployment process manually and automated tools and techniques are required. This report presents a graphbased approach for this purpose that is not dependent on any specific component technology and does the deployment planning with respect to the communication resources, i.e. channels, required by application components and communication resources available on the hosts in the target environment. In our approach, componentbased applications and distributed environments are modeled with the help of graphs. Deployment of an application is then defined as the mapping of the application graph to the target environment graph.  
Date  November 6, 2006  
Report  Caspian: A QoSAware Deployment Approach for Channelbased Componentbased Applications (PDF) 
CS200640  

Title  Unaligned Binary Codes for Index Compression in SchemaIndependent Text Retrieval Systems  
Authors  Stefan Buettcher and Charles L. A. Clarke  
Abstract 
We
examine
index
compression
techniques
for
schemaindependent
inverted
files
used
in
text
retrieval
systems.
Schemaindependent
inverted
files
contain
full
positional
information
for
all
index
terms
and
allow
the
structural
unit
of
retrieval
to
be
specified
dynamically
at
query
time,
rather
than
statically
during
index
construction.
Schemaindependent
indices
have
different
characteristics
than
documentoriented
indices,
and
this
difference
can
affect
the
effectiveness
of
index
compression
algorithms
greatly.
Our experimental results show that unaligned binary codes that take into account the special properties of schemaindependent indices achieve better compression rates than methods designed for compressing document indices and that they can reduce the size of the index by around 15% compared to bytealigned index compression. Moreover, we present a number of performanceenhancing techniques that may be used to very efficiently decode unaligned codes. Thus, their more compact index representation does not carry the cost of a substantially decreased query processing performance. This contradicts most earlier results on the decoding performance of unaligned codes.  
Date  October 11, 2006  
Report  Unaligned Binary Codes for Index Compression in SchemaIndependent Text Retrieval Systems (PDF) 
CS200641  

Title  BSim: A System for ThreeDimensional Visualization of Brownian Motion  
Authors  Tenn Francis Chen and Gladimir V. G. Baranoski  
Abstract  Brownian motion is considered to be of fundamental importance for theoretical and applied scientific research. To date, the computational renderings of this phenomenon available in the scientific literature have been limited to twodimensional case studies. This paper describes an interactive computer graphics system for the threedimensional visualization of this phenomenon. The visual simulations are based on the formulas provided in Einstein's seminal paper on this topic, which are carefully revisited to introduce the phenomenon's underlying physics. The system's predictive capability, which is a key attribute for educational and scientific applications, is demonstrated by the qualitative agreement between simulation results and actual Brownian motion observations.  
Date  November 22, 2006 (Revised April 29, 2009)  
Report  BSim: A System for ThreeDimensional Visualization of Brownian Motion (PDF) 
CS200642  

Title  SecondTier Cache Management Using Write Hints  
Authors  Xuhui Li, Ashraf Aboulnaga, Kenneth Salem, Aamer Sachedina, Shaobo Gao  
Abstract  Storage servers, as well as storage clients, typically have large memories in which they cache data blocks. This creates a twotier cache hierarchy in which the presence of a firsttier cache (at the storage client) makes it more difficult to manage the secondtier cache (at the storage server). Many techniques have been proposed for improving the management of secondtier caches, but none of these techniques use the information that is provided by {\em writes} of data blocks from the first tier to help manage the secondtier cache. In this paper, we illustrate how the information contained in writes from the first tier can be used to improve the performance of the secondtier cache. In particular, we argue that there are different reasons why storage clients write data blocks to storage servers (e.g., cleaning dirty blocks vs. limiting the time to recover from failure). These different types of writes can provide strong indications about the current state and future access patterns of a firsttier cache, which can help in managing the secondtier cache. We propose that storage clients inform the storage servers about the types of writes that they perform by passing {\em write hints}. These write hints can then be used by the server to manage the secondtier cache. We focus on the common and important case in which the storage client is a database system running a transactional (OLTP) workload. We describe, for this case, the different types of write hints that can be passed to the storage server, and we present several cache management policies that rely on these write hints. We demonstrate using trace driven simulations that these simple and inexpensive write hints can significantly improve the performance of the secondtier cache.  
Date  January 18, 2007  
Report  SecondTier Cache Management Using Write Hints (PDF) 
CS200643  

Title  Semantic Prefetching of Correlated Query Sequences  
Authors  Ivan T. Bowman, Kenneth Salem  
Abstract  We present a system that optimizes sequences of related client requests by combining small requests into larger ones, thus reducing perrequest overhead. The system predicts upcoming requests and their parameter values based on past observations, and prefetches results that are expected to be needed. We describe how the system makes its predictions and how it uses them to optimize the request stream. We also characterize the benefits with several experiments.  
Date  November 21, 2006  
Report  Semantic Prefetching of Correlated Query Sequences (PDF) 
CS200644  

Title  Correcting Sample Selection Bias by Unlabeled Data  
Authors  Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, Bernhard Sholkopf  
Abstract  We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by marching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.  
Date  January 18, 2007  
Report  Correcting Sample Selection Bias by Unlabeled Data (PDF) 
CS200645  

Title  Skyline and Topk Processing in Web Bargaining  
Authors  Mohamed A. Soliman, Ihab F. Ilyas, Nick Koudas  
Abstract  Skyline and topk queries are gaining increasing importance in many emerging applications. Current skyline and topk query processing techniques work on deterministic object attributes and known scores. However, in many practical scenarios these settings are inapplicable. In this paper we focus on web interaction scenarios where each interaction is a data object with a set of possible outcomes (scores). Obtaining exact interaction scores is expensive as it involves complex arbitration between interacting parties. Moreover, the outcome of each interaction might redefine other interactions scores. We demonstrate that the search space for solving such problems is very large. Based on this we formulate and present skyline and topk processing algorithms that can efficiently reduce the search space. We present the results of a thorough experimental evaluation quantifying the relative performance of the algorithms we propose herein with respect to costly exact solutions. Our results indicate that our techniques can efficiently reduce the space and identify precise solutions.  
Date  November 23, 2006  
Report  Skyline and Topk Processing in Web Bargaining (PDF) 
CS200646  

Title  List Update with Locality of Reference: MTF Outperforms All Other Algorithms  
Authors  Spyros Angelopoulos, Reza Dorrigiv, and Alejandro LopezOrtiz  
Abstract 
It
has
been
observed
that
in
practice,
typical
request
sequences
for
the
list
update
problem
exhibit
a
certain
degree
of
locality
of
reference.
We
first
extend
the
locality
of
reference
model
for
the
paging
problem
due
to
Albers
et
al~[STOC
2002/JCSS
2005]
to
the
domain
of
list
update;
this
addresses
the
open
question
of
defining
an
appropriate
model
for
capturing
locality
of
reference
in
the
context
of
list
update
[Hester
and
Hirschberg
ACM
Comp.
Surv.
1985].
We
then
apply
this
model
in
conjunction
with
a
recent
technique
for
comparing
online
algorithms,
namely
bijective
analysis~[SODA
2007]
and
analyze
well
known
online
algorithms
for
list
update.
Using this framework, we prove that MovetoFront (MTF) is the unique optimal algorithm for list update. This holds for both the standard cost function of Sleator and Tarjan~[C. ACM 1985] and the refined cost function proposed independently by Mart\'{\i}nez and Roura~[TCS 2000] and Munro~[ESA 2000]. Our work resolves an open conjecture of Mart\'{\i}nez and Roura, namely proposing a measure which can successfully separate MTF from all other algorithms.  
Date  November 22, 2006  
Report  List Update with Locality of Reference: MTF Outperforms All Other Algorithms (PDF) 
CS200647  

Title  MAXSM: A MultiHeuristic Approach to XML Schema Matching  
Authors  Mirza Beg, Laurent Charlin, Joel So  
Abstract  In this paper, we propose an automatic schema matching approach called MAXSM, designed specifically for matching schemas in the context of enterprise data integration. By focusing on enterprise data integration scenarios, it becomes viable to maintain a repository of established schema mappings and reuse that repository to enhance the accuracy of mapping candidates produced by MAXSM. As a consequence of focusing on the enterprise integration space, we further focus MAXSM on XML schema matching, specifically, we consider XML Schema Definition (XSD) schemas. MAXSM is a multiheuristic schema matching approach which employs, amongst other heuristics, a novel heuristic using WordNet for determining naturallanguage semantic similarity between node labels. MAXSM defines and describes how transitive mappings can be discovered from known mappings and used to seed production of candidate mappings. We present a cogent tree spanning approach to traverse and search the two schemas more effectively. While this paper does not document an implementation of MAXSM, we do discuss a highlevel implementation architecture as well as design of experiments to verify its efficacy.  
Date  December 11, 2006  
Report  MAXSM: A MultiHeuristic Approach to XML Schema Matching (PDF) 
CS200648  

Title  Succinct DataStructures for Labeled Graphs  
Authors  Jérémy Barbay  
Abstract 
Succinct
datastructures
support
efficient
search
queries
while using space asymptotically optimal. Such datastructures are known for structures such as binary strings, ordinal and cardinal trees, graphs, binary relations, labeled and multilabeled trees. We consider graphs where each node is associated to an arbitrary number of labels. We show that some categories of such graphs have a succinct encoding which supports efficiently labelbased navigation and search operators.  
Date  December 11, 2006  
Report  Succinct DataStructures for Labeled Graphs (PS)  Succinct DataStructures for Labeled Graphs (PDF) 