CS 748T
Topics in Databases - Distributed Database Management
Guidelines for Term Projects
Instructor: M. Tamer Özsu
Project Objectives and Scope
The projects consists of picking up a distributed database research problem
from among the ones listed below and working on its solution for the duration
of this term. I would be willing to entertain other topics that may be
of interest to you, but you should contact me immediately
What I expect in the project is a good understanding of the problem
(resulting in a survey part), insight into its solution and a well defined
strategy for its solution. You should treat the term project as if you
were doing the initial background study for further in-depth research.In
other words, the report should demonstrate an understanding of and an insight
into the problem such that given enough time, you could carry it to its
logical conclusion and complete the research.
The term project that you will prepare for this course will contribute
towards 50% of your final grade. Half of this will be for the survey part,
and the other half for the research part. These could be individual projects,
or group projects of two people. For group projects, both members of the
group will receive the same grade. Therefore, it is incumbent upon you
to make sure that both partners share in the work (and let me know very
quickly if the partnership is not working).
Deliverables
There are two deliverables of the term project:
-
A survey report that describes the problem domain, with proper problem
definition, and a survey of existing work. This should be about 25 typed
pages (12pt type with 1.5 spacing). This report, as indicated in the schedule
below, will be handed in at the end of the 6th week of classes. Each person
will then be responsible for presenting the field to class and leading
a discussion. This will start in Week 7 and each topic will get about 1.5
hours of class time.
-
A research report which will describe your own attempt to either solve
a problem in this domain or go a long way towards its solution. What I
minimally expect is a good solution approach such that if I gave you 2-3
more months, you could complete the solution, conduct the experiments and
produce a publishable paper. This report should be about 30 typed pages,including
a summary survey (i.e., a summary of part 1) of about 5 pages.
The first part is something that every individual (group) should do well;
the second part is going to vary with each individual's (group's) abilities.
It is possible that some groups may not do well on the second part; you
should, therefore, make sure that you do really well on the survey part
to get a decent mark.
Note that the report is very important. It should be written carefully
so that I can understand it easily. There will probably not be enough time
for me to spend an inordinate amount of effort trying to understand what
you mean. So include the necessary introductory material and make sure
that the presentation is good. All reports should be typewritten. Use a
word processor of your choice.
Schedule
The following is the schedule that we will follow.
-
January 3:
-
We decide if this is an individual or a joint project. If it is joint,
your first task is to find a partner.
-
January 17:
-
Find a problem that you want to work on. The list below gives some problems
that I am interested in and would like to cover in this course. If you
wish to pick up another topic, make an appointment and see me first.
-
January 31:
You should have finished reading the literature in the area by now.You'll
have ten days to write the survey report.
February 2:
I would like to receive a problem definition that is 1-3 typed pages.This
document should include a clear description of the problem on which you
are going to work in the area that you are writing your survey in.
-
February 9:
-
Your survey report should be in by 4PM.
-
April 11:
-
Absolute deadline for handing in final reports.
Research Topics
Here are a few research topics that you might wish to consider. Note that
the papers listed here are only to get you started. They are not meant
to be a complete list, and they many not even be the most important ones
from your perspective. My hope is that by the time you complete the project,
you'll have a good idea of what each area is about and what the more important
publications are. You should be looking at the proceedings of conferences
such as ACM SIGMOD, VLDB, ICDE, ICDCS, ... and at journals such as ACMTODS,
IEEE TKDE, VLDB Journal, Distributed and Parallel Databases Journal, Journal
of Intelligent Information Systems, World Wide Web, and many others.
Most of these publications can either be obtained through the University
of Waterloo Library's TRELLIS
System, or through ACM Digital Library,
or
through VLDB Endowment web page, or Michael
Ley's Computer
Science Bibliography. Michael's Bibliography is probably the best place
to start since it incorporates many of the papers. In a few cases where
I thought you might have difficulty finding the paper, I have included
a link.
Data Replication
Student:
Most distributed database systems are replicated. Replication has even
gained importance outside the database domain as a means of improving system
access performance and system reliability. There are some classical proposals,but
there are also newer ones that push the envelope significantly in how replicated
data are managed. The following is a combination of new and classic publications
for you to consider.
-
M. Buretta, Data Replication, Wiley, 1997.
-
S.B. Davidson, H. Garcia-Molina and D. Skeen, "Consistency in partitioned
networks", ACM Computing Surveys, 17(3): 341-370, 1985.
-
J. Gray, P. Helland, P. O"Neil, and D. Shasha, "The dangers of replication
and a solution", Proc. ACM SIGMOD Conference , 1996, pages 173-182.
-
B. Kemme and G. Alonso, "
Anew approach to developing and implementing eager database replication
protocols ", ACM Trans. Database Systems, to appear in September2000
issue.
-
B. Kemme and G. Alonso, "Don't be lazy, be consistent: Postgres-R, A new
way to implement database replication", Proc. VLDB Conference ,
2000,pages 134-143.
-
Y. Breitbart et al., "Update propagation protocols for replicated databases",Proc.ACM
SIGMOD Conference, 1999, pages 97-108.
-
E. Pacitti and E. Simon, "Update propagation strategies to improve freshness
in lazy master replicated databases", VLDB Journal , 8(3-4): 305-318,2000.
-
E. Pacitti, P. Minet and E. Simon, "Fast algorithms for maintaining replica
consistency in lazy master replicated algorithms", Proc. VLDB Conference,1999,
pages 126-137.
New networking architectures and their impact on DDBMS
Student: Ning Zhang
This research deals with the study of the effects of changing computer
network characteristics on the architecture and algorithms of DDBMSs. The
infrastructure on which DDBMSs are built is undergoing dramatic and rapid
changes. An important part of this infrastructure is the computer network
over which DDBMSs operate. A major change in this infrastructure is the
emergence of high-bandwidth, high-speed broadband networks coupled with
lightweight (low overhead) protocols. The characteristics of these networks
raise questions about the fundamental DDBMS assumption that the network
is the bottleneck. The question I want to investigate in this project is
exactly what these changes are, and what are the database architecture
components/protocols/... that need to be revisited.
-
A. Benner, Fibre Channel: Gigabit Communications and I/O for Computer
networks, McGraw Hill, 1995.
-
Papers from the Trapezeproject
at Duke University.
-
Infiniband Technology Specifications (I have these so contact me)
-
InfiniBand Architecture Specification, Volume 1 Release 1.0, Infiniband
Trade Association, October 2000.
-
InfiniBand Architecture Specification, Volume 2 Release 1.0, Infiniband
Trade Association, October 2000.
Web Data Management
Student:
This is a loaded topic.It means many things and it means different things
to different people. I have separated the topic into a number of segments.
Under this heading, I am more interested in modeling and querying Web data.
In addition to the Abiteboul, Bunemann, and Suciu's book Data on the
Web , the following references are relevant.
-
P. Atzeni, G. Mecca, and P. Merialdo, "To weave the Web", Proc. VLDB
Conference, 1997, pages 206-215.
-
D. Florescu, A. Levy, and A. Mendelzon, "Database techniques for the World
Wide Web: A survey", ACM SIGMOD Record, 27(3): 9-74, 1998.
-
G. Arocena and A. Mendelzon, "WebOQL: Restructuring documents, databases,
and Webs", Proc. ICDE, 1998.
-
A. Mendelzon and T. Milo, "Formal models of Web queries", Information
Systems, 28(6): 615-637, 1998.
-
A. Mendelzon, G. Mihaila, and T. Milo, "Querying the World Wide Web", Journal
of Digital Libraries, 1(1): 54-67, 1997.
-
P. Raghavan, "The Web as a graph", Proc. ACM PODS Conference, 2000.
-
S. Abiteboul, "Querying semi-structured data", Proc. 6th ICDT, 1997.
-
S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener, "The Lorel
query language for semistructured data", Journal of Digital Libraries,
1(1): 68-88, 1997.
-
P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu, "A query language
and optimization techniques for unstructured data", Proc. ACM SIGMOD
Conference, 1996, pages 505-516.
-
Y. Diao, H. Lu, S. Chen, and Z. Tian, "Toward learning based Web query
processing", Proc. VLDB Conference, 2000, pages 317-328.
-
S. Nesterov, S. Abiteboul, and R. Motwani, "Extracting schema from semistructured
data", Proc. ACM SIGMOD Conference, 1998.
Web Search Engines/Crawlers
Student: Sunny Lam
This topic focuses on an interesting aspect of the Web data management
that has significant overlaps with information retrieval. Therefore, I
suggest that you also look at information retrieval literature. The first
reference is a good one to start from.
-
Chapter 13 of R. Baeza-Yates and B. Riberio-Neto, Modern Information
Retrieval, Addison-Wesley, 1999.
-
S. Brin and L. Page, "
The anatomy of a large-scale hypertextual Web search engine", Proc.
7th WWW Conference, 1998.
-
A. Heydon and M. Najork, "
Mercator: A scalable, extensible Web crawler", World Wide Web,
2(4): 219-229, 1999.
-
L. Gravano and H. Garcia-Molina, "Generalizing GlOSS to vector-space databases
and broker hierarchies", Proc. VLDB Conference , 1995.
-
A. Howe and D. Dreilinger, "SavoySearch: A meta-search engine that learns
which engines to query", AI Magazine, 18(2), 1997.
-
S. Kirsch, "The future of Internet search: Infoseek's experiences searching
the Internet", ACM SIGIR Forum, 32(2): 3-7, 1998.
-
Special
issue of Data Engineering Bulletin, 23(3), September 2000.
-
L. Gravano, C-C.K. Chang, H. Garcia-Molina, and A. Paepcke, "STARTS: Stanford
proposal for Internet meta searching", Proc. ACM SIGMOD Conference,
1997, pages 207-218.
-
J. Kleinberg, "Authoritative sources in a hyperlinked environment", Proc.
ACM-SIAM Symp. on Discrete Algorithms, 1998, pages 668-677.
Web Mining
Student: Yan Wang
Web mining can focus on three major areas: Web usage mining, Web content
mining, and Web structure mining. The first paper below focuses on Web
usage mining. The second one is more general, and the third one deals with
exploiting links in the mining process. The other papers are not classified
this way.
-
J. Srivastava, R. Cooley, M.Deshpande, P.Tan, "
Web usage mining: discovery and applications of web usage pattern from
Web data", SIGKDD Explorations, 1(2), January 2000.
-
R. Kosala and H. Blockeel, "
Web Mining Research: A Survey", SIGKDD Explorations, 2(1),
June 2000.
-
S. Chakrabarti, "
Data Mining for Hypertext: A Tutorial Survey", SIGKDD Explorations
, 1(2), January 2000.
-
R. Cooley, B. Mobasher, J. Srivastava, "Web
Mining: Information and Pattern Discovery on the World Wide Web", on-line
publication.
Web Caching
Student: Leon Cao
This is a very wide topic. One can talk about static versus dynamic Web
caching algorithms, or one can focus on Web cache consistency algorithms,
or even proxy caching systems. The literature is so wide that it is hard
to give a representative sample. However, I have the following list for
now and will be adding to it soon. You can also find a ton of them if you
do a Web search.
-
Take a look at the bibliography http://www.research.att.com/~misha/tutorial/tutorial-bib.txt
-
M. Franklin, M. Carey and M. Livny, "Transactional Client-Server Cache
Consistency: Alternatives and Performance", ACM Transactions on Database
Systems, September, 1997.
-
G. Bang, F. Douglis, and M. Rabinovich, "Optimistic deltas for WWW latency
reduction", Proc. USENIX Technical Conference , 1997, pages 289-304.
-
F. Douglis, A. Haro, and M. Rabinovich, "HPP:HTML Macro-preprocessing to
support dynamic document caching", Proc. USENIX Symp. on Internet Technologies
and Systems, 1997, pages 83-94.
-
P. Cao, J. Zhang, and K. Beach, "Active cache: Caching dynamic contents
on the Web", Proc. IFIP Int. Conf. on Distributed System Platforms and
Open Distributed Processing, 1998, pages 373-388.
-
B. Smith, A. Acharya, T. Yang, H. Zhu, "Caching equivalent and partial
results for dynamic Web content", Proc. USENIX Symp. on Internet Technologies
and Systems, 1999.
-
K. Yagoub, D. Florescu, V. Issarny, and P. Valduriez, "Caching strategies
for data-intensive Web sites", Proc. VLDB Conference, 2000.
Pervasive and Mobile Computing
Student:
This topic is also quite wide, so you'll need to focus. Some literature
makes a distinction between mobile computing and pervasive computing, so
you'll need to be careful. Also, note that I am interested in data management
problems in this domain, not general systems issues.
-
A. Helal et al., Any Time, Anywhere Computing , Kluwer, 1999.
-
E. Pitoura and G. Samaras, Data Management for Mobile Computing,
Kluwer, 1998.
-
Proceedings of the RIDE Workshop, 2000. (I have this one so come
see me).
-
C. Bobineau, L. Bouganim, P. Pucheral, P. Valduriez, "PicoDBMS: Scaling
down database techniques for the Smartcard", Proc. VLDB Conference,
2000, pages 11-20.
-
IBM Systems Journal Special
Issue on Pervasive Computing, 38(4), 1999.
-
IBM's Pervasive Computing project which can be accessed here.
-
T.C. Agoston, T. Ueda, and Y. Nishimura, "
Pervasive computing in a networked world", available on-line.
-
Rome project
at Stanford.
-
CoolTown project at HP Labs.
-
Aroma project
at NIST.
-
Also look at the journal Mobile Networks and Applications .
Distributed View Management
Student: Lubomir Stanchev
The title of the topic is perhaps not exactly very informative. There are
two ways to look at this. One is the materialized view maintenance issue.
The other is the maintenance of distributed virtual views. There is more
material on the first topic, which is more popular these days. So, the
followign are the possible literature on that topic:
-
A. Gupta and I. S. Mumick (eds.), Materialized Views: Techniques, Implementations,
and Applications, MIT Press, 1999.
-
A. Gupta, and I. S. Mumick, "Maintenance
of Materialized Views: Problems, Techniques, and Applications", IEEE
Data Engineering Bulletin, 18(2): 3-18, June 1995.
-
Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. "View
Maintenance in a Warehousing Environment", Proc. ACM SIGMOD Conference,
1995, pages 316-327.
-
D. Quass, A. Gupta, I.S. Mumick, and J. Widom. "Making
Views Self-Maintainable for Data Warehousing", Proc. Fourth Intl.
Conf. on Parallel and Distributed Information Systems (PDIS '96), 1996.
-
Y. Kotidis and N. Roussopoulos, "DynaMat:
A Dynamic View Management System for Data Warehouses", Proc. ACM/SIGMOD
Conference, 1999, pages 371-382.
Information Integration
Student: Meng He
There are a number of ways of approaching this topic. One can look at information
integration using views (references 1-4), or techniques to deal with schematic
and data discrepencies (references (5-9) or looking at integration/componentization
architectures using distributed platforms such as CORBA or DCOM/OLE, or
the use of XML for data integration. You should choose one of these dimensions
to work on.
-
J. D. Ullman. "Information
Integration Using Logical Views", Proc. ICDT Conference, 1997, pages
19-40.
-
A. Y. Levy. "
Answering
Queries Using Views: A Survey", unpublished (yet) manuscript.
-
A. Y. Levy, A. Rajaraman, J. J. Ordille. "
Query-Answering
Algorithms for Information Agents", Proc. AAAI Conference, 1996.
-
R. Pottinger and A. Levy. "
A
Scalable Algorithm for Answering Queries Using Views, Proc. VLDB Conference,
2000.
-
R. Krishnamurthy, W. Litwin, and W. Kent. "Language
Features for Interoperability of Databases with Schematic Discrepancies".
Proc. SIGMOD Conference, 1991,pages 40-49.
-
L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian."
SchemaSQL
- A Language for Interoperability in Relational Multi-Database Systems",
Proc VLDB Conference, 1996, pages 239-250.
-
R. J. Miller. " Using
Schematically Heterogeneous Structures", Proc. SIGMOD Conference,
1998, pages 189-200.
-
L-L Yan and M. T. Özsu, "Conflict
Tolerant Queries in AURORA", Proc. CoopIS Conference, 1999, pages 279-290
-
Va. Josifovski and T. Risch. "Integrating
Heterogeneous Overlapping Databases Through Object-Oriented Transformations",
Proc.
VLDB Conference, 1999, pages 435-446.
Hybrid Query Execution Models
Student: Ivan Bowman
The issue here can be looked at either within the context of client-server
object DBMSs, or client-server relational DBMSs. In the former, the usual
execution model is data shipping from the server to the client, which is
then responsible for execution of queries on the cached data. In the latter,
the usual execution model is function shipping where the client ships the
SQL query to the server without processing and receives the result. Hybrid
query execution considers the possibility of executing queries at both
the client and the server. The project should focus either on relational
or object DBMSs (at least originally).
-
D. Kossmann and M. J. Franklin."A
Study of Query Execution Strategies for Client-Server Database Systems",
University of Maryland Technical Report CS-TR-3512, August 1995.
-
M. J. Franklin, B.T. Jónsson, and D. Kossmann."Performance
Tradeoffs for Client-Server Query Processing", Proc. SIGMOD Conf.,
1996, pages 149-160.
-
CMU ABACUS project:
Dynamic Function Placement for Data-Intensive Cluster Computing
-
D. Petrou, K. Amiri, G. Ganger and G. Gibson. "Easing the management of
data-parallel systems via adaptation", Proc. European SIGOPS Workshop,
Kolding, Denmark, September 2000.
-
K. Amiri, D. Petrou, G. Ganger and G. Gibson. "Dynamic function placement
for data-intensive cluster computing", Proc. USENIX Annual Technical
Conference, San Diego, CA, June 2000.
-
Microsoft - Millenium
Project /Coign Sub-Project
-
Carleton
University Chameleon Project
-
L. Olstad, C. Brady, B. McHollan, J. Ramirez. "Jini
Technology: Impromptu Networking and its Impact on Telecommunications",
Proceedings of Capstone 1999, University of Colorado at Boulder,
Fall, 1999.
-
K. Voruganti, M. T. Özsu, R. C. Unrau. "An
Adaptive Hybrid Server Architecture for Client Caching ODBMSs", Proc.
VLDB Conference, 1999, pages 150-161.
-
K.Voruganti. An Adaptive CS Architecture for ODBMSs. PhD Thesis, Spring
2001, University of Alberta.