Weekly Schedule
The following is the weekly schedule for the course.We will complete the
study of "classical" distributed database systems in six weeks. The remainder
of the time will be devoted to discussing more recent topics and your projects.
In the weekly schedule, I have indicated the material that you need
to read. Whenever there is a reference to the textbook, I will be
lecturing. The papers will be presented by students. Each presentation
will be 30 minutes followed by 15 minutes of discussion (sometimes we may
extend these times) and each presenter will be required to submit (at the
time of the presentation) a 5-7 page critique of the paper. This will count
towards one of the two paper critiques. Requirements for the paper critiques
will follow.
Each of these papers are available electronically through the TRELLIS
system of the University of Waterloo Library. Whenever a paper may be easier
to obtain otherwise or have not yet been published, I have placed a link
to it from the paper's title.
Week 1 - January 3, 2001
Introduction and architectural issues
-
Chapters 1-4 from the textbook.
Week 2 - January 10, 2001
Data distribution/distributed query processing
-
Sections 5.1-5.2 and Chapters 7&8 from the textbook.
Week 3 - January 17, 2001
Distributed query optimization
-
Chapter 9 from the textbook.
-
Chapter 3 in C. Yu and W. Meng, Principles of Query Processing for Advanced
Database Applications, Morgan Kaufmann, 1997.
-
G. Graefe, "Query Evaluation Techniques for Large Databases", ACM Computing
Surveys, June 1993.
Presenter: Ivan Bowman
-
D. Kossmann, "The
State of the Art in Distributed Query Processing", to appear in ACM
Computing Surveys. (PDF format)
Presenter: Sunny Lam
Week 4 - January 24, 2001
Multi-Database Query Processing
-
Section 15.2 of the textbook
-
Chapter 4 in C. Yu and W. Meng, Principles of Query Processing for Advanced
Database Applications, Morgan Kaufmann, 1997.
-
L.M. Haas, D. Kossmann, E. Wimmers, and J. Yang, "Optimizing
queries across diverse data sources", Proc. Int'l Conf. on VLDB,
1997. (PDF format)
Presenter: Hui Zhang
-
A. Tomasic, L. Raschid, and P. Valduriez, "Scaling Access to Heterogeneous
Data Sources with DISCO", IEEE Trans. on Knowledge and Data Eng.,
10(5): 808-823, 1998.
Presenter: Yan Wang
-
M.T. Roth and P. Schwarz, "Don't
Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources",
Proc.
Int'l. Conf. on VLDB, 1997. (PDF format)
Presenter: Lubomir Petrov Stanchev
Week 5 - January 31, 2001
Transaction Processing and Concurrency Control
-
Chapters 10 & 11 of the textbook.
-
D. Georgakopoulos, M. Hornick, and A. Sheth, "An Overview of Workflow Management:
From Process Modeling to Workflow Automation Infrastructure", Distributed
and Parallel Databases, 3: 119-153, 1995.
Presenter: Yuxin Cao
-
G. Weikum, "Principles and Realization Strategies of Multilevel Transaction
Management", ACM Trans. on Database Systems, 16(1): 132-180, 1991.
Presenter: Ning Zhang
Week 6 - February 7, 2001
Distributed Database Reliability
-
Chapter 12 of the textbook.
-
C. Mohan and I. Narang, "ARIES/CSA: A method for database recovery in client-server
architectures", Proc. ACM SIGMOD Conference, 1994, pages 55-66.
This requires knowledge of the following:
-
C. Mohan et al., "ARIES: A Transaction Recovery Method Supporting Fine-Granularity
Locking and Partial Rollbacks Using Write-Ahead Logging", ACM Trans.
on Database Systems, 17(1): 94-162, 1992.
Presenter: Meng He
Week 7 - February 14, 2001 - Survey Talks
-
Hybrid Query Execution Models, Ivan
Bowman (Presentation slides)
Abstract: The client-server relational execution
model has proven to be a very effective architecture for
partitioning the functionality of distributed data-intensive applications.
This model partitions execution costs into a procedural
portion executed at the client which sends
queries to a relational portion executed at the server.
This partitioning does not take full advantage of
available resources. Client resources can
not be used to perform relational processing, and server resources
can not be used to execute procedural client code. Practitioners
have used several ad-hoc mechanisms to distribute the
execution cost of client-server systems. Stored procedures, user-defined
functions, and advanced data types allow
application code to execute on the server.
Further, some applications compute relational expressions such
as filters, joins, and grouping operations on the
client in custom application code. While these mechanisms
can significantly improve application performance,
they are cumbersome to use, and require partitioning
choices to be made statically early on in the development process.
In this talk, I will discuss extending the client
server model to allow an optimizer to select the execution
site for both
relational and procedural code. In this hybrid
model, clients can perform local processing of relational operators, and
servers can execute fragments of procedural code on behalf of the
client application. Such a model requires the following:
-
a query processing engine on the client, accessing either a
local cache (with associated consistency issues) or retrieving data
from the server;
-
a mechanism to execute procedural code on
the server (possibly requiring state information from
the client), and
-
an optimizer that can choose a reasonable execution plan considering
these alternatives.
-
Data Synchronization in a Distributed
Database Environment, Lubomir Stanchev (Presentation slides)
Abstract: In
a distributed database the stored data is often related. For example
the results from often executed queries may be stored at one or more sites
in order to reduce the query result reporting time. We will refer
to such stored query results as materialized views and to the data on which
the queries are posed as the underlying data. The main concern
in such a model is synchronizing the related data. For example we
would like when the underlying data is updated the updates to be propagated
to the materialized view - this problem is called view maintenance.
As well we might want to allow for a restricted type of materialized view
updates, and when such updates occur, the underlying data may have to be
updated accordingly.
An important problem is imposing constraints on
what data updates are allowed and to what extend the data should by synchronized.
Possible constraints on data updates may include that materialized views
may not be directly updated or that this can be done only if certain predicates
on the data hold. As well we may require that local and global integrity
constraints on the data should hold and only updates that preserve those
integrity constraints should be allowed. Examples of synchronization
constraints include specifying which data items should be always up-to-date,
and which may be lagging the most up-to-date data, but not by more than
a specified amount of time.
In this survey talk we will explore relevant research
in the areas of view maintenance, view update and data integration.
Time permitting, we will discuss how the existing theory applies to the
problem of data synchronization and how we can exploit integrity constraints
and auxiliary data to improve the performance of existing algorithms.
February 21, 2001 (Study
week, no class)
Week 8 - February 28, 2001 - Survey Talks
-
The Overview of Web Search Engines,
Sunny Lam (Presentation slides)
Abstract: The World Wide Web allows people to share information globally. The amount of information grows without bound. In order to extract information that we are interested in, we need a tool to search the Web. The tool is called a search engine. This survey covers different components of the search engine and how the search engine really works. It provides a background understanding of information retrieval. It discusses different search engines that are commercially available. It investigates how the search engines find information in the Web and how they rank its pages to the given query. Also, the paper provides guidelines for users on how to use search engines.
-
Consistency Control Algorithms for
Web Caching, Leon Cao (Presentation slides)
Abstract: The World Wide Web is increasing exponentially
in size, which leads to rapidly increased traffic on the Internet. Thus
reducing the volume of the network traffic produced by web clients and
servers, and improving the response time for users have been the critical
issues. The continued growth of World Wide Web has increased both network
load as well as server load. Hence there is a growing concern for effectively
managing the bandwidth demands of users and reducing the latency for clients.
Over the years Web Caching has become an increasingly important topic.
The use of web caches has become a cheap and effective way to improve performance
for all Internet users. A web cache sits between web servers and a client
or many clients, and watches requests for HTML pages, images and files
(generally, objects) come by, saving a copy for itself. Then, if there
is another request for the same object, it will use the copy that it has,
instead of asking the original server for it again.
There are two main reasons that web caches are used:
-
To reduce latency - Because the request is satisfied from the cache (which
is closer to the client) instead of the original server, it takes less
time for the client to get the object and display it. This makes web sites
seem more responsive.
-
To reduce traffic - Because each object is only retrieved from the server
once, it reduces the amount of bandwidth used by a client. This saves money
if the client is paying by traffic, and keeps their bandwidth requirements
lower and more manageable.
However, the web caching technology still has a lot of open issues.
One of them is that many web caches do not satisfactorily keep cached contents
consistent with web servers. How to ensure the consistency between cached
contents in the cache, and those on the actual web server, how to check
if the cached page is still fresh, and when should it be checked and refreshed
if necessary, these questions lead to the topic of cache consistency. Cache
consistency protocols for client/server database systems have been the
subject of much study in recent years and at least a dozen different algorithms
have been proposed and studied in this area.
Week 9 - March 7, 2001 - No class
Week 10 - March 14, 2001 - Survey Talks
- Web Mining and Knowledge Discovery of Usage Patterns, Yan Wang (Presentation slides)
Abstract:Web mining is a very hot research object combining two of the activated
research areas: Data Mining and World Wide Web. The web mining research
relates to, therefore is attractive to, several research communities such
as Database, Information Retrieval and Artificial Intelligence. Although
there exists quite some confusion about the web mining processes, the most
recognized approach is to categorize web mining into three areas: Web
content mining, Web structure mining, and Web usage mining. Web content
mining focus on the discovery/retrieve the useful information from the web
contents/data/documents, while the web structure mining emphasizes to the
discovery of how to model underlying the link structures of the Web. There
isn't a very clear distinction between these two categories. Web usage
mining is a relative independent, but not isolated, category, which mainly
describes the techniques that discover the user's usage pattern and try to
predict the user's behaviours.
My talk will focus on the web usage mining, encompassing the three phases
of the usage mining: Pre-processing(using the relational tables to map the
usage data to web server before using data mining technology, using the
usage logs directly by utilizing the pre-processing techniques), Pattern
discovery and Pattern analysis. I will give a brief introduction on an
example of a web usage mining system, and talk generally about some
current work on this research area.
-
New networking architectures and their
impact on DDBMS, Ning Zhang (Presentation slides)
Abstract: DBMS as an I/O intensive software has been focused on how
to efficiently make use cache, main memory, and disk storage in the hierachy.
In the past decade, Distributed Database Management Systems (DDBMS) has
introduced computer network as another access medium. Traditionaly, computer
networks was put in the lower layer than disk storage in the hiearchy.
With the advent of high-speed network and low overhead protocols, computer
networks has higher bandwidth than hard disk. How to reflect the fundamental
changes on the DDBMS or even traditional DBMS architecture remains open.
In this report, I shall survey the current status of high-speed network
especially gigabit/gigabyte networks, and the low overhead protocols
and architecture -- Fibre Channel and Infiniband. Different approaches
for incorporating high-speed networks to operating systems are also
introduced.
Week 11 - March 21, 2001 - Research Presentations
-
Exploiting Networks in Distributed Sorting and Relational Operations, Ning Zhang
-
Temporal Analysis in Usage Pattern Discovery, Yan Wang
-
Lease-Augmented Cache Consistency Algorithm and its Performance
Estimation, Leon Cao
Week 12 - March 28, 2001 - Research Presentations
-
The MultiText Query and Answering System, Sunny Lam
-
Automatically Partitioning Client-Server Applications, Ivan Bowman
-
Proposal for a System for Semantic Data Control in Distributed Database
Environment, Lubomir Stanchev