CS 748T
Topics in Databases - Distributed Database Management

Guidelines for Term Projects

Instructor: M. Tamer Özsu

Project Objectives and Scope

The projects consists of picking up a distributed database research problem from among the ones listed below and working on its solution for the duration of this term. I would be willing to entertain other topics that may be of interest to you, but you should contact me immediately

What I expect in the project is a good understanding of the problem (resulting in a survey part), insight into its solution and a well defined strategy for its solution. You should treat the term project as if you were doing the initial background study for further in-depth research.In other words, the report should demonstrate an understanding of and an insight into the problem such that given enough time, you could carry it to its logical conclusion and complete the research.

The term project that you will prepare for this course will contribute towards 50% of your final grade. Half of this will be for the survey part, and the other half for the research part. These could be individual projects, or group projects of two people. For group projects, both members of the group will receive the same grade. Therefore, it is incumbent upon you to make sure that both partners share in the work (and let me know very quickly if the partnership is not working).

Deliverables

There are two deliverables of the term project:

A survey report that describes the problem domain, with proper problem definition, and a survey of existing work. This should be about 25 typed pages (12pt type with 1.5 spacing). This report, as indicated in the schedule below, will be handed in at the end of the 6th week of classes. Each person will then be responsible for presenting the field to class and leading a discussion. This will start in Week 7 and each topic will get about 1.5 hours of class time.
A research report which will describe your own attempt to either solve a problem in this domain or go a long way towards its solution. What I minimally expect is a good solution approach such that if I gave you 2-3 more months, you could complete the solution, conduct the experiments and produce a publishable paper. This report should be about 30 typed pages,including a summary survey (i.e., a summary of part 1) of about 5 pages.

The first part is something that every individual (group) should do well; the second part is going to vary with each individual's (group's) abilities. It is possible that some groups may not do well on the second part; you should, therefore, make sure that you do really well on the survey part to get a decent mark.

Note that the report is very important. It should be written carefully so that I can understand it easily. There will probably not be enough time for me to spend an inordinate amount of effort trying to understand what you mean. So include the necessary introductory material and make sure that the presentation is good. All reports should be typewritten. Use a word processor of your choice.

Schedule

The following is the schedule that we will follow.

January 3:: We decide if this is an individual or a joint project. If it is joint, your first task is to find a partner.
January 17:
January 31:
February 9:
April 11:: Absolute deadline for handing in final reports.

Research Topics

Here are a few research topics that you might wish to consider. Note that the papers listed here are only to get you started. They are not meant to be a complete list, and they many not even be the most important ones from your perspective. My hope is that by the time you complete the project, you'll have a good idea of what each area is about and what the more important publications are. You should be looking at the proceedings of conferences such as ACM SIGMOD, VLDB, ICDE, ICDCS, ... and at journals such as ACMTODS, IEEE TKDE, VLDB Journal, Distributed and Parallel Databases Journal, Journal of Intelligent Information Systems, World Wide Web, and many others.

Most of these publications can either be obtained through the University of Waterloo Library's TRELLIS System, or through ACM Digital Library, or through VLDB Endowment web page, or Michael Ley's Computer Science Bibliography. Michael's Bibliography is probably the best place to start since it incorporates many of the papers. In a few cases where I thought you might have difficulty finding the paper, I have included a link.

Data Replication

Student:

Most distributed database systems are replicated. Replication has even gained importance outside the database domain as a means of improving system access performance and system reliability. There are some classical proposals,but there are also newer ones that push the envelope significantly in how replicated data are managed. The following is a combination of new and classic publications for you to consider.

M. Buretta, Data Replication, Wiley, 1997.
S.B. Davidson, H. Garcia-Molina and D. Skeen, "Consistency in partitioned networks", ACM Computing Surveys, 17(3): 341-370, 1985.
J. Gray, P. Helland, P. O"Neil, and D. Shasha, "The dangers of replication and a solution", Proc. ACM SIGMOD Conference , 1996, pages 173-182.
B. Kemme and G. Alonso, " Anew approach to developing and implementing eager database replication protocols ", ACM Trans. Database Systems, to appear in September2000 issue.
B. Kemme and G. Alonso, "Don't be lazy, be consistent: Postgres-R, A new way to implement database replication", Proc. VLDB Conference , 2000,pages 134-143.
Y. Breitbart et al., "Update propagation protocols for replicated databases",Proc.ACM SIGMOD Conference, 1999, pages 97-108.
E. Pacitti and E. Simon, "Update propagation strategies to improve freshness in lazy master replicated databases", VLDB Journal , 8(3-4): 305-318,2000.
E. Pacitti, P. Minet and E. Simon, "Fast algorithms for maintaining replica consistency in lazy master replicated algorithms", Proc. VLDB Conference,1999, pages 126-137.

New networking architectures and their impact on DDBMS

Student: Ning Zhang

This research deals with the study of the effects of changing computer network characteristics on the architecture and algorithms of DDBMSs. The infrastructure on which DDBMSs are built is undergoing dramatic and rapid changes. An important part of this infrastructure is the computer network over which DDBMSs operate. A major change in this infrastructure is the emergence of high-bandwidth, high-speed broadband networks coupled with lightweight (low overhead) protocols. The characteristics of these networks raise questions about the fundamental DDBMS assumption that the network is the bottleneck. The question I want to investigate in this project is exactly what these changes are, and what are the database architecture components/protocols/... that need to be revisited.

A. Benner, Fibre Channel: Gigabit Communications and I/O for Computer networks, McGraw Hill, 1995.
Papers from the Trapezeproject at Duke University.
Infiniband Technology Specifications (I have these so contact me)

InfiniBand Architecture Specification, Volume 1 Release 1.0, Infiniband Trade Association, October 2000.
InfiniBand Architecture Specification, Volume 2 Release 1.0, Infiniband Trade Association, October 2000.

Web Data Management

Student:

This is a loaded topic.It means many things and it means different things to different people. I have separated the topic into a number of segments. Under this heading, I am more interested in modeling and querying Web data. In addition to the Abiteboul, Bunemann, and Suciu's book Data on the Web , the following references are relevant.

P. Atzeni, G. Mecca, and P. Merialdo, "To weave the Web", Proc. VLDB Conference, 1997, pages 206-215.
D. Florescu, A. Levy, and A. Mendelzon, "Database techniques for the World Wide Web: A survey", ACM SIGMOD Record, 27(3): 9-74, 1998.
G. Arocena and A. Mendelzon, "WebOQL: Restructuring documents, databases, and Webs", Proc. ICDE, 1998.
A. Mendelzon and T. Milo, "Formal models of Web queries", Information Systems, 28(6): 615-637, 1998.
A. Mendelzon, G. Mihaila, and T. Milo, "Querying the World Wide Web", Journal of Digital Libraries, 1(1): 54-67, 1997.
P. Raghavan, "The Web as a graph", Proc. ACM PODS Conference, 2000.
S. Abiteboul, "Querying semi-structured data", Proc. 6th ICDT, 1997.
S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener, "The Lorel query language for semistructured data", Journal of Digital Libraries, 1(1): 68-88, 1997.
P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu, "A query language and optimization techniques for unstructured data", Proc. ACM SIGMOD Conference, 1996, pages 505-516.
Y. Diao, H. Lu, S. Chen, and Z. Tian, "Toward learning based Web query processing", Proc. VLDB Conference, 2000, pages 317-328.
S. Nesterov, S. Abiteboul, and R. Motwani, "Extracting schema from semistructured data", Proc. ACM SIGMOD Conference, 1998.

Web Search Engines/Crawlers

Student: Sunny Lam

This topic focuses on an interesting aspect of the Web data management that has significant overlaps with information retrieval. Therefore, I suggest that you also look at information retrieval literature. The first reference is a good one to start from.

Chapter 13 of R. Baeza-Yates and B. Riberio-Neto, Modern Information Retrieval, Addison-Wesley, 1999.
S. Brin and L. Page, " The anatomy of a large-scale hypertextual Web search engine", Proc. 7th WWW Conference, 1998.
A. Heydon and M. Najork, " Mercator: A scalable, extensible Web crawler", World Wide Web, 2(4): 219-229, 1999.
L. Gravano and H. Garcia-Molina, "Generalizing GlOSS to vector-space databases and broker hierarchies", Proc. VLDB Conference , 1995.
A. Howe and D. Dreilinger, "SavoySearch: A meta-search engine that learns which engines to query", AI Magazine, 18(2), 1997.
S. Kirsch, "The future of Internet search: Infoseek's experiences searching the Internet", ACM SIGIR Forum, 32(2): 3-7, 1998.
Special issue of Data Engineering Bulletin, 23(3), September 2000.
L. Gravano, C-C.K. Chang, H. Garcia-Molina, and A. Paepcke, "STARTS: Stanford proposal for Internet meta searching", Proc. ACM SIGMOD Conference, 1997, pages 207-218.
J. Kleinberg, "Authoritative sources in a hyperlinked environment", Proc. ACM-SIAM Symp. on Discrete Algorithms, 1998, pages 668-677.

Web Mining

Student: Yan Wang

Web mining can focus on three major areas: Web usage mining, Web content mining, and Web structure mining. The first paper below focuses on Web usage mining. The second one is more general, and the third one deals with exploiting links in the mining process. The other papers are not classified this way.

J. Srivastava, R. Cooley, M.Deshpande, P.Tan, " Web usage mining: discovery and applications of web usage pattern from Web data", SIGKDD Explorations, 1(2), January 2000.
R. Kosala and H. Blockeel, " Web Mining Research: A Survey", SIGKDD Explorations, 2(1), June 2000.
S. Chakrabarti, " Data Mining for Hypertext: A Tutorial Survey", SIGKDD Explorations , 1(2), January 2000.
R. Cooley, B. Mobasher, J. Srivastava, "Web Mining: Information and Pattern Discovery on the World Wide Web", on-line publication.

Web Caching

Student: Leon Cao

This is a very wide topic. One can talk about static versus dynamic Web caching algorithms, or one can focus on Web cache consistency algorithms, or even proxy caching systems. The literature is so wide that it is hard to give a representative sample. However, I have the following list for now and will be adding to it soon. You can also find a ton of them if you do a Web search.

Take a look at the bibliography http://www.research.att.com/~misha/tutorial/tutorial-bib.txt
M. Franklin, M. Carey and M. Livny, "Transactional Client-Server Cache Consistency: Alternatives and Performance", ACM Transactions on Database Systems, September, 1997.
G. Bang, F. Douglis, and M. Rabinovich, "Optimistic deltas for WWW latency reduction", Proc. USENIX Technical Conference , 1997, pages 289-304.
F. Douglis, A. Haro, and M. Rabinovich, "HPP:HTML Macro-preprocessing to support dynamic document caching", Proc. USENIX Symp. on Internet Technologies and Systems, 1997, pages 83-94.
P. Cao, J. Zhang, and K. Beach, "Active cache: Caching dynamic contents on the Web", Proc. IFIP Int. Conf. on Distributed System Platforms and Open Distributed Processing, 1998, pages 373-388.
B. Smith, A. Acharya, T. Yang, H. Zhu, "Caching equivalent and partial results for dynamic Web content", Proc. USENIX Symp. on Internet Technologies and Systems, 1999.
K. Yagoub, D. Florescu, V. Issarny, and P. Valduriez, "Caching strategies for data-intensive Web sites", Proc. VLDB Conference, 2000.

Pervasive and Mobile Computing

Student:

This topic is also quite wide, so you'll need to focus. Some literature makes a distinction between mobile computing and pervasive computing, so you'll need to be careful. Also, note that I am interested in data management problems in this domain, not general systems issues.

A. Helal et al., Any Time, Anywhere Computing , Kluwer, 1999.
E. Pitoura and G. Samaras, Data Management for Mobile Computing, Kluwer, 1998.
Proceedings of the RIDE Workshop, 2000. (I have this one so come see me).
C. Bobineau, L. Bouganim, P. Pucheral, P. Valduriez, "PicoDBMS: Scaling down database techniques for the Smartcard", Proc. VLDB Conference, 2000, pages 11-20.
IBM Systems Journal Special Issue on Pervasive Computing, 38(4), 1999.
IBM's Pervasive Computing project which can be accessed here.
T.C. Agoston, T. Ueda, and Y. Nishimura, " Pervasive computing in a networked world", available on-line.
Rome project at Stanford.
CoolTown project at HP Labs.
Aroma project at NIST.
Also look at the journal Mobile Networks and Applications .

Distributed View Management

Student: Lubomir Stanchev

The title of the topic is perhaps not exactly very informative. There are two ways to look at this. One is the materialized view maintenance issue. The other is the maintenance of distributed virtual views. There is more material on the first topic, which is more popular these days. So, the followign are the possible literature on that topic:

A. Gupta and I. S. Mumick (eds.), Materialized Views: Techniques, Implementations, and Applications, MIT Press, 1999.
A. Gupta, and I. S. Mumick, "Maintenance of Materialized Views: Problems, Techniques, and Applications", IEEE Data Engineering Bulletin, 18(2): 3-18, June 1995.
Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. "View Maintenance in a Warehousing Environment", Proc. ACM SIGMOD Conference, 1995, pages 316-327.
D. Quass, A. Gupta, I.S. Mumick, and J. Widom. "Making Views Self-Maintainable for Data Warehousing", Proc. Fourth Intl. Conf. on Parallel and Distributed Information Systems (PDIS '96), 1996.
Y. Kotidis and N. Roussopoulos, "DynaMat: A Dynamic View Management System for Data Warehouses", Proc. ACM/SIGMOD Conference, 1999, pages 371-382.

Information Integration

Student: Meng He

There are a number of ways of approaching this topic. One can look at information integration using views (references 1-4), or techniques to deal with schematic and data discrepencies (references (5-9) or looking at integration/componentization architectures using distributed platforms such as CORBA or DCOM/OLE, or the use of XML for data integration. You should choose one of these dimensions to work on.

J. D. Ullman. "Information Integration Using Logical Views", Proc. ICDT Conference, 1997, pages 19-40.
A. Y. Levy. " Answering Queries Using Views: A Survey", unpublished (yet) manuscript.
A. Y. Levy, A. Rajaraman, J. J. Ordille. " Query-Answering Algorithms for Information Agents", Proc. AAAI Conference, 1996.
R. Pottinger and A. Levy. " A Scalable Algorithm for Answering Queries Using Views, Proc. VLDB Conference, 2000.
R. Krishnamurthy, W. Litwin, and W. Kent. "Language Features for Interoperability of Databases with Schematic Discrepancies". Proc. SIGMOD Conference, 1991,pages 40-49.
L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian." SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems", Proc VLDB Conference, 1996, pages 239-250.
R. J. Miller. " Using Schematically Heterogeneous Structures", Proc. SIGMOD Conference, 1998, pages 189-200.
L-L Yan and M. T. Özsu, "Conflict Tolerant Queries in AURORA", Proc. CoopIS Conference, 1999, pages 279-290
Va. Josifovski and T. Risch. "Integrating Heterogeneous Overlapping Databases Through Object-Oriented Transformations", Proc. VLDB Conference, 1999, pages 435-446.

Hybrid Query Execution Models

Student: Ivan Bowman

The issue here can be looked at either within the context of client-server object DBMSs, or client-server relational DBMSs. In the former, the usual execution model is data shipping from the server to the client, which is then responsible for execution of queries on the cached data. In the latter, the usual execution model is function shipping where the client ships the SQL query to the server without processing and receives the result. Hybrid query execution considers the possibility of executing queries at both the client and the server. The project should focus either on relational or object DBMSs (at least originally).

D. Kossmann and M. J. Franklin."A Study of Query Execution Strategies for Client-Server Database Systems", University of Maryland Technical Report CS-TR-3512, August 1995.
M. J. Franklin, B.T. Jónsson, and D. Kossmann."Performance Tradeoffs for Client-Server Query Processing", Proc. SIGMOD Conf., 1996, pages 149-160.
CMU ABACUS project: Dynamic Function Placement for Data-Intensive Cluster Computing

D. Petrou, K. Amiri, G. Ganger and G. Gibson. "Easing the management of data-parallel systems via adaptation", Proc. European SIGOPS Workshop, Kolding, Denmark, September 2000.
K. Amiri, D. Petrou, G. Ganger and G. Gibson. "Dynamic function placement for data-intensive cluster computing", Proc. USENIX Annual Technical Conference, San Diego, CA, June 2000.

Microsoft - Millenium Project /Coign Sub-Project

G. C. Hunt and M. L. Scott. "The Coign Automatic Distributed Partitioning System", Proc.3rd Symposium on Operating System Design and Implementation, 1999, pp 187-200.
G. C. Hunt, Automatic Distributed Partitioning of Component-Based Applications, Ph.D. Dissertation, University of Rochester, Rochester, NY, July 1998.

Carleton University Chameleon Project
L. Olstad, C. Brady, B. McHollan, J. Ramirez. "Jini Technology: Impromptu Networking and its Impact on Telecommunications", Proceedings of Capstone 1999, University of Colorado at Boulder, Fall, 1999.
K. Voruganti, M. T. Özsu, R. C. Unrau. "An Adaptive Hybrid Server Architecture for Client Caching ODBMSs", Proc. VLDB Conference, 1999, pages 150-161.
K.Voruganti. An Adaptive CS Architecture for ODBMSs. PhD Thesis, Spring 2001, University of Alberta.