2005 Technical Reports | SCS | UW
[Please remove <h1>]
CS-2005-01 |
Title |
Index-Trees for Descendant Tree Queries on XML documents |
Authors |
Jérémy Barbay |
Abstract |
We propose an index structure and an algorithm to answer Descendant Tree queries, the subset of XPath
queries limited to the descendant axis, in computational models where elements can be accessed in non-sequential order. The indexing scheme uses classical labeling techniques, but structurally represents the ancestor-descendant relationships between nodes of same type, in order to allow faster searches. The algorithm solves XPath location path queries along the descendant axis in a holistic way, looking at the whole query rather than just concentrating on individual components, as opposed to “set-at-a-time” strategies. Using an adaptive analysis, we prove tight bounds on the worst case complexity in the comparison model, and a reasonable upper bound on the worst case complexity in the hierarchical memory model, which accounts for the number of I/O transfers at any fixed level. Keywords: XPath location paths; comparison model; holistic; cache-oblivious; adaptive.
|
Date |
January 2005
|
Comments |
|
Report |
Latex |
PostScript |
Adobe PDF |
|
CS-2005-02 |
Title |
Stereo Rendering: An Overview |
Authors |
Jamie Brettle amd Gladimir Baranoski |
Abstract |
Humans visually perceive the world using a variety of depth
cues.
Stereopsis is an example of such a depth cue, and it is directly
associated with the human binocular visual system. Different computer
graphics approaches can be used to take advantage of stereopsis in
order to enable a true 3D perception of virtual environments. In this
report, we briefly review
computer graphics hardware and software tools designed for this purpose
as
well as optimization techniques that can be applied to speed up the
synthesis of stereo images.
|
Date |
December 2005
|
Comments |
|
Report |
|
|
Adobe PDF |
|
CS-2005-03 |
Title |
A new formulation for haplotype inference by pure parsimony |
Authors |
Daniel G. Brown, Ian M. Harrower |
Abstract |
|
Date |
January 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-04 |
Title |
Building an Embedded Control Program Workload |
Authors |
Heng Yu and Grant Weddell |
Abstract |
A compiled database application has a repository style of architecture in which a collection of modules operate on a common database in main memory through a set of predefined transaction types. To conduct simulated experiments to evaluate the performance of locking techniques for compiled database application, we need to design a workload to approximate the data processing in main memory. We choose the open-source MINIX operating system as our study case, and select a set of system calls as predefined transaction types. We map the kernel data structures to relational tables, and the source code in C programming language to C code with embedded SQL statements (C/SQL code). Furthermore, the C/SQL code are transformed to the finite state machines of the transaction types, and the costs of states and transitions are extracted from the original C code in terms of numbers of machine instructions. Thus the workload is built up. We also discuss how to model the costs of locking/unlocking operations in our workload.
|
Date |
February 2005
|
Comments |
|
Report |
Latex |
PostScript |
Adobe PDF |
|
CS-2005-05 |
Title |
Empirical Analysis on the Geriatric Care Data Set Using Rough Sets Theory |
Authors |
Jiye Li and Nick Cercone |
Abstract |
A well known problem for association rules generation is that too many rules are generated, and it is difficult to determine manually which rules are more useful, interesting and important. In our study of using rough sets theory to improve the utility of association rules, we propose a new rule importance measure to select the most appropriate rules. In order to explore the application of the proposed method to large data sets, we perform the experiments on a Geriatric Care data set. |
Date |
February 2005
|
Comments |
|
Report |
Latex |
PostScript |
Adobe PDF |
|
CS-2005-06 |
Title |
Query-Based Approach to Workflow Process Dependency Analysis |
Authors |
Weizhen Dai and H. Doninic Covvey |
Abstract |
Dependency analysis is important in all of the stages
of workflow processes. An appropriate analysis will help us to identify
the potentially affected entities if changes occur. In this technical report
we thoroughly analyze the critical entities of workflow processes and their
dependency relationships, and propose a multi-dimensional dependency model,
that includes routing dependency, data dependency and role dependency.
We further build a knowledge base to represent the multi-dimensional dependency
model using the logic programming language Prolog. We then develop
a set of query rules that can be applied to the well-defined knowledge
base at both activity and process levels to retrieve the potentially affected
entities. Finally we use a case study of workflow processes in the healthcare
domain to show how our dependency analysis methodology works.
|
Date |
October 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-09 |
Title |
The complexity of Domino Tiling |
Authors |
Therese Biedl |
Abstract |
In this paper, we study the problem of how to tile a layout with dominoes. For non-coloured dominoes, this can be determined easily by testing whether the layout graph has a perfect matching. We study here tiling with
coloured dominoes, where colours of adjacent dominoes must match. It was known that this problem is NP-hard when the layout graph is a tree. We first strengthen this NP-hardness result in two ways: (1) we can use a
path instead of a tree, or (2) we can force that exactly all given dominoes are used. However, both these reductions (as well as the existing one) use an unbounded numbers of colours, which is not realistic for domino tiling. As our main interest, we hence study domino tiling with a constant number of colours. We show that this is NP-hard even with 3 colours. |
Date |
February 2005
|
Comments |
|
Report |
Latex |
PostScript |
Adobe PDF |
|
CS-2005-10 |
Title |
Indexing the Results of Sliding Window Queries |
Authors |
Lukasz Golab, Piyush Prahladkaz and M. Tamer Ozsu |
Abstract |
A popular method of bounding the memory requirements of queries over data streams is to use a sliding window, where old data are continuously removed as new data arrive. One problem that has not been addressed previously concerns indexing the results of sliding window queries. This is a noteworthy problem because data stream systems often materialize common sub-expressions or final results of similar queries, therefore it is important to allow the queries efficient access into a relevant subset of the materialized result. In this paper, we design and evaluate indexing methods that take advantage of the temporal order in which old answers expire for efficient maintenance and querying of the results. Our techniques allow sharing of indexed results among similar queries, adapt to uctuating stream arrival rates, and are experimentally shown to perform updates over twice as fast as the existing sliding window indices. |
Date |
May 2005
|
Comments |
|
Report |
Latex |
PostScript |
Adobe PDF |
|
CS-2005-11 |
Title |
RDL: A Language for Framework
Instantiation Representation |
Authors |
Toacy C. Oliveira, Paulo S. C. Alencar, Carlos J.P. de Lucena,
Donald D. Cowan. |
Abstract |
Reusing software artifacts for system development is showing
increasing promise as an approach to reducing the time and effort involved
in building new systems, and to improving the software development process
and the quality of its outcome. However, software reuse has an associated
steep learning curve, since practitioners must become familiar with a third
party rationale for representing and implementing reusable assets. For this
reason, enabling a systematic approach to the reuse process by making software
reuse tasks explicit, allowing software frameworks to be instantiated using
pre- defined primitive and complex reuse operations, and supporting the
reuse process in a (semi-)automated way become crucial goals. In this paper,
we introduce a systematic reuse approach and Reuse Description Language
(RDL), a language designed to specify object-oriented framework instantiation
processes, and an RDL execution environment, which is the tool support for
definition and execution of reuse processes and framework
♦
Toacy C. Oliveira is with Faculty of Informatics at the Pontifical University
Catholic of Rio Grande do Sul, Brazil . His e-mail address is toacy@inf.pucrs.br.
Paulo S. C. Alencar and Donald D. Cowan are with the School of Computer Science
at the University of Waterloo, Ontario, Canada. Their e-mail addresses are {palencar,
dcowan} @csg.uwaterloo.ca respectively. Carlos J. P. de Lucena is with the Computer
Science Department at the Pontifical University of Rio de Janeiro Brazil. His
e-mail is lucena@inf.puc-rio.br. instantiations that lead to domain-specific
applications. We illustrate our approach using
JUnit, a framework for testing. |
Date |
May 2005
|
Comments |
|
Report |
|
|
Adobe PDF |
|
CS-2005-12 |
Title |
xTAO: Enabling a Declarative Approach to the Specification
of Multi-Agent Systems |
Authors |
Toacy C. Oliveira, Paulo Alencar, Don Cowan, Carlos Lucena |
Abstract |
Research on software agents has produced a diversity of
conceptual models for high-level abstract descriptions of multi-agent systems
(MASs). However, it is still difficult and costly for designers that need
a unique set of agent modeling features to either develop a new agent modeling
language from scratch or undertake the task of modifying an existing language.
In addition to the modeling itself, in both cases a significant effort
need to be expended in building or adapting tools to support the language.
An extensible agent modeling language is crucial to experimenting with
and building tools for novel modeling constructs that arise from evolving
research. Existing approaches typically support a basic set of modeling
constructs very well, but adapt to others poorly. A declarative language
such as XML and its supporting tools provides an ideal platform upon which
to develop and extensible modeling language for multi-agent systems. In
this paper we describe xTAO, an extensible agent modeling language, and
also demonstrate its value in the context of a real-world application.
|
Date |
2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-13 |
Title |
Complexity of Octagonal and Rectangular Cartograms |
Authors |
T. Biedl and B. Genc ∗ |
Abstract |
In this paper, we study the complexity of cartograms, i.e.,
maps where every region should be deformed such that a given area
requirement is satisfied. We consider the case where every region
is an orthogonal octagon, and show that this problem is NP-hard.
>From our proof, it also follows that cartograms with rectangular faces
are NP-hard if we allow for the existence of a "sea", i.e.,
a region of arbitrarily high complexity on the outside of the drawing.
|
Date |
December 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-14 |
Title |
Shortest Path Trees Computation in Dynamic Graphs |
Authors |
Edward P. F. Chan & Yaya Yang |
Abstract |
Let G = (V, E,w) be a simple digraph, in which all edge weights
are non-
negative real numbers. Let G¡¦ be obtained from G by the application
of a set of
edge weight updates to G. Let s „¡ V , and let Ts and T be a Shortest
Path Tree
(SPT) rooted at s in G and G , respectively. The Dynamic Shortest Path
(DSP)
problem is to compute T from Ts. For the DSP problem, we correct and
extend a
few existing SPT algorithms to handle multiple edge weight updates. We
prove
that these extended algorithms are correct. The complexity of these algorithms
is also
analyzed. To evaluate the proposed algorithms, we compare them with the
well-
known static Dijkstra algorithm. Extensive experiments are conducted
with both
real-life and artificial data sets. The real-life data are road
system graphs
obtained from the Connecticut road system and are relatively sparse.
The
artificial data are randomly generated graphs and are relatively
dense. The
experimental results suggest the most appropriate algorithms to be used
under
different circumstances.
|
Date |
March 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-15 |
Title |
Optimization and Evaluation of Shortest Path Queries |
Authors |
Edward P. F. Chan & Heechul Lim |
Abstract |
We investigate the problem of how to evaluate efficiently
a collection of
shortest path queries on massive graphs that are too big to fit in the
main
memory. To evaluate a shortest path query efficiently, we introduce two
pruning
algorithms. These algorithms differ on the extent of materialization
of
shortest path cost and on how the search space is pruned. By grouping
shortest
path queries properly, batch processing improves the performance of shortest
path query evaluation. Extensive study is also done on fragment sizes,
cache sizes and query types that we show that affect the performance
of a disk-based
shortest path algorithm. The performance and scalability of proposed
techniques
are evaluated with large road systems in the Eastern United States. To
demonstrate that the proposed disk-based algorithms are viable, we show
that
their search times are significantly better than that of main-memory
Dijkstra's
algorithm.
|
Date |
September 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-16 |
Title |
URL-Enhanced Adaptive Page-Refresh Models |
Authors |
Robert Warren, Dana Wilkinson, Alejandro L'opez-Ortiz |
Abstract |
We study the refresh model required to keep an up to date
copy of a web
page. This has applications for more efficient and frequent crawling,
in
the case of search engines, and for higher hit rates for proxy servers
with pre-fetching. Two main models have been proposed namely the uniform
refresh model (Cho and Garcia-Molina, 2000) and the adaptive page refresh
model (Edwards et al., 2001), with some debate as to the relative value
of
each model.
In this work we show that adaptive page refresh models can be further
improved by careful selection of initial page-refresh rates of newly
added
links as indicated by our page evolution studies showing that page-change
rates (and consequently page-refresh rates) are dependent on top-level
domain, category and page depth.
|
Date |
May 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-17 |
Title |
Vocabulary size and email authentication |
Authors |
Robert Warren |
Abstract |
This paper explores the performance of the method proposed
by Efron and
Thisted to predict vocabulary sizes based on sampled text. The objective
of
this research is to determine whether this simple and quick test can
be used
as a coarse indicator of authorship. Three sets of emails, as well as
other
texts are analyzed in order to collect performance data. The conclusion
is
that the test is at best a lower bound indicator within the $T<$1.0 region
and is not sufficient as an authentication method.
|
Date |
July 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-18 |
Title |
Title: Dynamic Histograms for Non-Stationary Updates |
Authors |
Elizabeth Lam and Kenneth Salem |
Abstract |
In this paper, we address the problem of incrementally maintaining a histogram in response to a non-stationary update process. In relational database systems, this problem can occur whenever relations model time-varying activities. We present a simple update model that is general enough to describe both stationary and non-stationary update processes, and we use it to show that existing histogram maintenance techniques can perform poorly when updates are non-stationary. We describe several techniques for solving this problem, and we use the update model to demonstrate that these techniques can effectively handle a broad range of update processes, including non-stationary ones.
|
Date |
June 2005
|
Comments |
|
Report |
Latex |
PostScript |
Adobe PDF |
|
CS-2005-19 |
Title |
Optimal Bsic Block Instruction Scheduling for Multiple-Issue
Processors using Constraint Programming |
Authors |
Abid M. Malik, Jim McInnes and Peter van Beek |
Abstract |
Instruction scheduling is one of the most important steps
for improving the performance of object code produced by a
compiler. A fundamental problem that arises in instruction scheduling
is to find a minimum length schedule for a basic block---a
straight-line sequence of code with a single entry point and
a single exit point---subject to precedence, latency, and
resource constraints. Solving the problem exactly is
NP-complete, and
heuristic approaches are currently used
in most compilers.
In contrast, we present a scheduler that finds provably {\em optimal}
schedules for basic blocks using techniques from constraint programming.
In developing our optimal scheduler,
the keys to scaling up to large, real
problems were improvements to the constraint model and to
the constraint propagation phases.
We experimentally evaluated
our optimal scheduler on the
SPEC 2000 integer and floating point benchmarks.
On this benchmark suite, the optimal scheduler
was very robust and scaled
to the largest basic blocks---all
but a handful of the hundreds of thousands of basic
blocks in our benchmark suite could not be
solved optimally within a reasonable time limit, and
the scheduler was able to routinely solve the
largest basic blocks, including
basic blocks with up to 2600 instructions.
This compares favorably to the best previous exact approaches.
|
Date |
November 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-20 |
Title |
An Evaluation of Metro Express |
Authors |
Shahram Esmaeilsabzali, Leonard Wong, and Nancy A. Day |
Abstract |
We study Express, a system that uses template semantics to
map
specifications to SMV models. We investigate the efficiency of the generated
SMV models. We consider two case studies and compare manually created
SMV
models with models generated by Express. The generated models are more
complex and have larger state spaces, and consequently longer verification
times. We also analyze the effect of using different optimization schemes
in
Cadence SMV. Interestingly, some of the optimizations are more applicable
to
the generated models and made verification of the generated models
outperform the manually created models for some queries. However,
considering the best case for each model, the manually created model
always
outperforms the generated model. We propose some optimizations that could
be
used to improve dramatically the efficiency of the models generated by
Express. We estimate the magnitude of improvements that such optimizations
can provide.
|
Date |
December 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-21 |
Title |
An XML Routing Synopsis in Unstructured P2P Networks |
Authors |
Qiang Wang, Abhay Kumar Jha, and Tamer Ozsu |
Abstract |
Many emerging applications that use XML are distributed, usually over large peer-to-peer (P2P) networks on the Internet. The deployment of an XML query shipping system over P2P networks requires a specialized synopsis to capture XML data in routing tables. In this paper, we propose a novel graph-structured routing synopsis, called kd-synopsis, for deployment over unstructured super-peer based P2P networks. This synopsis is based on length-constrained FBsimulation relationship, which allows the balancing of the precision and size of the synopsis according to different space constraints on peers with heterogeneous capacity. We report comprehensive experiments to demonstrate the effectiveness of the kd-synopsis. |
Date |
August 2006 |
Comments |
|
Report |
Latex |
PostScript |
Adobe PDF |
|
CS-2005-22 |
Title |
XSEED: Accurate and Fast Cardinality Estimation for XPath
Queries |
Authors |
Ning Zhang, M. Tamer Ozsu, Ashraf Aboulnaga, and Ihab F.
Ilyas |
Abstract |
Cardinality estimation is a crucial part of a cost-based
optimizer. Many research efforts have been focused on XML synopsis structures
of path queries for cardinality estimation in recent years. In ideal situations,
a synopsis should provide accurate estimates for different types of queries
over a wide variety of data sets, consume
a small amount of memory while being able to adjust as memory budgets
change, and be easy to construct and update. None of the existing synopsis
proposals satisfy all of the above requirements. In this paper, we propose
a novel synopsis, XSEED, that is accurate, robust, efficient, and adaptive
to memory budgets. We construct an XSEED structure starting from a very
small kernel, then incrementally update information of the synopsis. With
such an incremental construction, a synopsis structure can be dynamically
configured to accommodate different memory budgets. It can also handle updates
to underlying XML documents, and be self-tuning by incorporating query feedback.
Cardinality estimation based on XSEED can be performed very efficiently
and accurately, with our techniques of small footprint and novel recursion
handling. Extensive experiments on both synthetic and real data sets are
conducted, and our results show that even with less memory, the accuracy
of XSEED could achieve an order of magnitude better than that of other synopsis
structures. The cardinality estimation time is under 2% of the actual querying
time for a wide range of queries in all test cases.
|
Date |
June 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-23 |
Title |
Proceedings of the Doctoral Consortium at the 13th IEEE International
Requirements Engineering Conference |
Authors |
Edited by Nancy A. Day, University of Waterloo |
Abstract |
|
Date |
August 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-24 |
Title |
Multi column matching for database schema translation |
Authors |
Robert H. Warren, Dr. Frank Wm. Tompa |
Abstract |
We describe a generalised method for discovering complex
schema matches involving multiple database columns. The method does not
require linked data and is capable of dealing with both fixed- and variable-length
field columns.
This is done through an iterative algorithm which learns the correct
sequence of concatenations of column substrings in order to translate from
one database to another. We introduce the algorithm along with examples
on common database data values and examine its performance on real-world
and synthetic data sets.
|
Date |
August 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-25 |
Title |
Assisting Framework Instantiation: Enhancements to Process-Language-based Approaches
|
Authors |
Marcillio Mendonca, Paulo Alencar, Toacy Oliveira, Donald Cowan
|
Abstract |
Application frameworks have been successfully used as valuable
tools to improve software quality while reducing development efforts.
Nevertheless, frameworks still face important challenges in order to be
widely adopted. In particular, framework instantiation is still a
painful task requiring application developers to understand the
intricate details surrounding the framework design. Some approaches to
alleviate this problem have already been proposed in the literature but
they are usually either just a textual cross-referenced document of the
instantiation activities or too tied to technology or specific
application domains. In this paper, we present the results of our latest
investigations to improving our approach to framework instantiation. In
particular, we discuss a process language we have developed to guide
framework instantiation explicitly, and the most recent updates we have
made to improve the language expressiveness. Furthermore, we present a
case study used to evaluate our approach and to identify current and
future extensions.
|
Date |
August 2005
|
Comments |
|
Report |
|
|
Adobe PDF |
|
CS-2005-26 |
Title |
Interface Automata with Complex Actions - Extended Version |
Authors |
Shahram Esmaeilsabzali, Nancy A. Day, and Farhad Mavaddat |
Abstract |
Many formalisms use interleaving to model concurrency. To
describe some
system behaviours appropriately, we need to limit interleaving. For example,
in component-based systems, we wish to limit interleaving to force the
inputs to a method to arrive together in order. In Web services, the
arrival
of XML messages consisting of multiple simple parts should not be
interleaved with the behaviour of another component. We introduce "interface
automata with complex actions" (IACA), which add complex actions to
de
Alfaro and Henzinger's interface automata (IA). A complex action is a
sequence of actions that may not be interleaved with actions from other
components. The composition and refinement operations are more involved
in
IACA compared to IA, and we must sacrifice associativity of composition.
However, we argue that the advantages of having complex actions make
it a
useful formalism. We provide proofs of various properties of IACA and
discuss the use of IACA for modelling Web services.
|
Date |
2005, Revised May 15, 2006
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-27 |
Title |
Toward an understanding of haplotype inference by pure parsimony |
Authors |
Daniel G. Brown, Ian M. Harrower |
Abstract |
|
Date |
January 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-28 |
Title |
On Concurrency Control in Sliding Window Queries over Data |
Authors |
Lukasz Golab, Kumar Gaurav Bijay, M. Tamer
Özsu |
Abstract |
Data stream systems execute a dynamic workload of long-running
and one-time queries, with the streaming inputs typically bounded by
sliding windows. For efficiency, windows may be advanced periodically
by replacing the oldest part of the window with a batch of newly arrived
data. Existing work on stream processing assumes that a window
cannot be advanced while it is being accessed by a query. In this paper,
we argue that concurrent processing of queries (reads) and window-slides
(writes) is required by data stream systems in order to allow prioritized
query scheduling and improve the freshness of answers. We prove that
the traditional notion of conflict serializability
is insufficient in this context and define stronger isolation levels
that restrict the allowed serialization orders. We also design and experimentally
evaluate a transaction scheduler that efficiently enforces the new isolation
levels by taking advantage of the access patterns of sliding window queries.
|
Date |
December 2005
|
Comments |
|
Report |
|
|
Adobe PDF |
|
CS-2005-29 |
Title |
InterJoin: Exploiting Materialized Views in XML Query Processing |
Authors |
Derek Phillips, Ning Zhang, Ihab F. Ilyas, and M. Tamer Özsu |
Abstract |
Efficient processing of XPath expressions is an integral
part of XML data query processing. Exploiting materialized views in query
processing can significantly enhance query processing performance. We propose
a novel view definition that allows for intermediate (structural) join results
to be stored and reused in XML query evaluation. Unlike current XML view
proposals, our views do not require navigation in the original document
or path-based pattern matching. Hence, they can be evaluated significantly
faster and can be more easily costed as part of a query plan.
In general, current structural joins cannot exploit views efficiently
when the view definition is not a prefix (or a suffix) of the XPath
query. To increase the applicability of our proposed view definition,
we propose a novel physical structural join operator called InterJoin.
The InterJoin operator allows for joining interleaving XPath expressions,
e.g., joining //A//C with //B to evaluate //A//B//C. InterJoin treats
structural joins as a logical operator, giving more join alternatives
in XML query plan.
We propose several physical implementations for InterJoin, including
a technique to exploit spatial indexes on the inputs. We give analytic
cost models for the implementations so they can be costed in an existing
join-based XML query optimizer. Experiments on real and synthetic XML
data show significant speed-ups of up to 200% using InterJoin, as well
as speed-ups of up to 400% using our materialized views.
|
Date |
October 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-31 |
Title |
Indexing Time vs. Query Time Trade-offs in Dynamic Information Retrieval Systems |
Authors |
Stefan B�ttcher and Charles L. A. Clarke |
Abstract |
We examine issues in the design of fully dynamic information
retrieval systems with support for instantaneous document insertions and
deletions. We present one such system and discuss some of the major design
decisions. These decisions affect both the indexing and the query processing
efficiency of our system and thus represent genuine trade-offs between indexing
and query processing performance.
Two aspects of the retrieval system -- fast, incremental updates and
garbage collection for delayed document deletions -- are discussed in detail,
with a focus on the respective trade-offs. Depending on the relative number
of queries and update operations, different strategies lead to optimal
overall performance. Special attention is given to a particular case of
dynamic search systems -- desktop and file system search. As one of the
main results of this paper, we demonstrate how security mechanisms necessary
for multi-user support can be extended to realize efficient document deletions.
|
Date |
October 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-32 |
Title |
Memory Management Strategies for Single-Pass Index Construction
in Text Retrieval Systems |
Authors |
Stefan Büttcher and Charles L. A. Clarke |
Abstract |
Many text retrieval systems construct their index by accumulating
postings in main memory until there is no more memory available and then
creating an on-disk index from the in-memory data. When the entire text
collection has been read, all on-disk indices are combined into one big
index through a multiway merge process.
This paper discusses several ways to arrange postings in memory and studies
the effects on memory requirements and indexing performance. Starting from
the traditional approach that holds all postings for one term in a linked
list, we examine strategies for combining consecutive postings into posting
groups and arranging these groups in a linked list in order to reduce the
number of pointers in the linked list. We then extend these techniques
to compressed posting lists and finally evaluate the effects they have
on overall indexing performance for both static and dynamic text collections.
Substantial improvements are achieved over the initial approach.
|
Date |
October 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-33 |
Title |
Adding variability management to UML-based software product
lines |
Authors |
Edson Alves de Oliveira Junior a, Itana M. S. Gimenes a,*,
Elisa H. M. Huzita a, José C. Maldonado b, Paulo Alencar c a Departamento
de Informática, Universidade Estadual de Maringá (UEM), Maringá, Paraná,
Brazil b Departamento de Ciências de Computação e Estatística, Universidade
de São Paulo (USP), São Carlos, São Paulo, Brazil c Computer System Group,
School of Computer Science, University of Waterloo, Waterloo, Canada |
Abstract |
The software product line (PL) approach promotes the generation
of specific products from a set of core assets for a given domain. This
approach is applicable to domains in which products have well-defined commonalities
and variation points. Variability management is concerned with the management
of the differences between products throughout the PL lifecycle. This paper
presents a UML-based process for variability management that allows identification,
representation and delimitation of variabilities; and, identification of
mechanisms for variability implementation. The evaluation of the process
was carried out as a case study within the context of an existing PL for
the Workflow Management System (WfMS) domain. The results have shown that
the proposed process includes well-defined control mechanisms that increase
the possibility of identifying and tracing variabilities.
|
Date |
October 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-34 |
Title |
Inferring a Serialization Order for Distributed Transactions |
Authors |
Khuzaima Daudjee and Kenneth Salem |
Abstract |
Data partitioning is often used to scale-up a database
system. In a
centralized database system, the serialization order of commited update
transactions can be inferred from the database log. To achieve this in
a
shared-nothing distributed database, the serialization order of update
transactions must be inferred from multiple database logs. We describe
a
technique to generate a single stream of updates from logs of multiple
database systems. This single stream represents a valid serialization
order of update transactions at the sites over which the database is
partitioned.
|
Date |
February 2007
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-35 |
Title |
Adaptive Searching in Succinctly Encoded Binary Relations
and Tree-Structured Documents |
Authors |
Jérémy Barbay, Alexander Golynski, J. Ian Munro,
and S Srinavasa Rao |
Abstract |
Abstract. This paper deals with succinct representations
of data types motivated by applications in posting lists for search engines,
in querying XML documents, and in the more general setting (which extends
XML) of multi-labeled trees, where several labels can be assigned to each
node of a tree. To find the set of references corresponding to a set of
keywords, one typically intersects the list of references associated with
each keyword. We view this instead as having a single list of objects [n]
= {1, . . . , n} (the references), each of which has a subset of the labels
[σ] = {1, . . . , σ} (the keywords) associated with it. We are able to find
the objects associated with an arbitrary set of keywords in time O(δk lg
lg σ) using a data structure requiring only t(lg σ+o(lg σ)) bits, where
δ is the number of steps required by a non-deterministic algorithm to check
the answer, k is the number of keywords in the query, σ is the size of the
set from which the keywords are chosen, and t is the number of associations
between references and keywords. The data structure is succinct in that
it differs from the space needed to write down all t occurrences of keywords
by only a lower order term. An XML document is, for our purpose, a labeled
rooted tree. We deal primarily with “non-recursive labeled trees”, where
no label occurs more than once on any root to leaf path. We find the set
of nodes which path from the root include a set of keywords in the same
time, O(δk lg lg σ), on a representation of the tree using essentially minimum
space, 2n + n(lg σ +o(lg σ)) bits, where n is the number of nodes in the
tree. If we permit nodes to have multiple labels, this space bound becomes
2n + t(lg σ + o(lg σ)) bits, that is the information theoretic lower bound
for an ordinal tree (a node can have an arbitrary number of children ordered
from left to right) plus that for the multiple labeling, where t is the
total number of labels assigned. In proving those results, we consider two
data-structures if independant interest: we give an encoding for σ by n
boolean matrices, using optimal space and supporting in time O(lg lg σ)
the operators access (the value at the intersection of a row and a column)
rank (how many matches occur in this row to the left of this entry, or how
many are in this column and above), and select (find the r-th match in this
row, or in this column); and we give an encoding for labeled trees of n
nodes and σ labels, using optimal space and supporting in time O(lg lg σ)
the labeled based operator labeltree desc(a, x), which finds the first descendant
of x labeled a. Keywords: succinct data-structures, labeled trees, multi-labeled
trees, conjunctive queries, intersection problem, opt-threshold set.
|
Date |
November 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-36 |
Title |
Viscoelastic Registration of Medical Images |
Authors |
Zhao Yi and Justin Wan |
Abstract |
Since the physical behavior of many tissues is shown to be
viscoelastic, we propose a novel registration technique for medical
images via physical principles of viscoelasticity. It allows the
template and the target images to differ by large local deformations
like viscous fluid and small local deformations like elastic solid.
With an extra elastic term introduced, the proposed model constrains
the transformation by coupled viscoelastic behavior and therefore
reduces the smearing effect caused by the fluid terms. We apply our
approach to both artificial test data and MR brain slices yielding
accurate registration results.
|
Date |
December 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
CS-2005-37 |
Title |
Image Registration via Particle Movement |
Authors |
Zhao Yi and Justin Wan |
Abstract |
Though fluid model offers a good approach to nonrigid registration
with large deformations, it suffers from the blurring artifacts introduced
by the viscosity term. To overcome this drawback, we
present an inviscid model expressed in a particle framework. Our
idea is to simulate the template image as a set of particles moving
towards the target positions. The proposed model can accommodate
both small and large deformations with sharp edges and reasonable
displacement maps achieved. Because of its simplicity, the
computational cost is much smaller than other nonrigid registration
approaches.
|
Date |
December 2005
|
Comments |
|
Report |
Latex |
|
Adobe PDF |
|
|