Please note: I have prepared a list of students taking the course. As people pick topics, we will fill these. To find a partner,
The course will be divided into three parts. In the first part, I will lecture on the fundamentals of distributed data management. This will be only for two weeks or so. The second part of the course will be paper reviews and discussions. The third part of the course will be devoted to presentation of research projects.
Each of you are responsible for picking a paper that you wish to present from the readings. The presentation should go beyond the paper and give the background and also how it relates to other work. Note that the last thing I am looking for is a linear presentation of the sections in the papers. These presentations should be 20-30 minutes. You are responsible for preparing the presentation slides and making them available to me by noon of the Monday of the week you will be presenting. These slides will be put online for everyone.
This presentation will be followed by a discussion of the paper (for about 30 minutes). Everyone is expected to actively participate in the debate (note that part of the final mark is devoted to class participation). Consequently, everyone should come to class prepared with questions, counter examples, and even suggested improvements. With any luck, this will set up a debate-like atmosphere in which we can argue about the pros and cons of the basic technologies.
You might find the short brochure "Efficient Reading of Papers in Science and Technology" by Michael J. Hanson and updated by D. McNamee very useful.
Everyone will write one paper critique per week. You can choose which paper you write a critique on (of course, the are expected to write critiques of the papers they present). I will set up an on-line system for entering these critiques. Primarily, you should think of these are paper reviews for a conference and try to identify the strengths and weaknesses of the paper and how it can be improved. The reviews/critiques should be about 2 pages in length.
For paper critiques, the following (relatively old) paper should be quite useful: A.J. Smith, The Task of the Referee, IEEE Computer, April 1990.
Course organization
"Classical" distributed database issues
- Chapters 1, 4, 5 (5.1 and 5.2), 7, 8, 9, 10-12 from the principal reference.
- Course slides (PDF)
- Slides in 3-up handout format (PDF)
Review of topics covered in past term
- Amr Al-Helw, Web Search/Querying
- Rolanda Blanco, P2P Data Management
Paper W1: K. C.-C. Chang, B. He, and Z. Zhang. Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web, In Proc. 2nd Conf. on Innovative Data Systems Research (CIDR 2005), January 2005.
- Presenter: Hossein S. Attar (Presentation Slides)
Paper W2: S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity Flooding: A Versatile Graph Matching Algorithm, In Proc. Int. Conf. on Data Engineering, 2002.
- Presenter: George Beskales (Presentation Slides)
Paper W3: W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web, In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004.
- Presenter: Yingying Tao (Presentation Slides)
Paper S1: R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data stream management system. In Proc. 1st Biennial Conf. on Innovative Data Syst. Res., pages 245-256, 2003.
- Presenter: Abram Hindle (Presentation Slides)
Paper S2: D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2):120-139, Aug 2003.
- Presenter: Joel So (Presentation Slides)
Paper S3: J. Li, D. Maier, K. Tufte, V. Papadimos, and P. Tucker. Semantics and evaluation techniques for window aggregates in data streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 311-322, 2005.
- Presenter: Kevin Quan (Presentation Slides)
Paper W4: A. Kementsietsidis, M. Arenas, R. J. Miller: Mapping Data in Peer-to-Peer Systems. Semantics and Algorithmic Issues, In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 325-336, 2003.
- Presenter: Bojan Jovanovic (Presentation Slides)
Paper W5: B. He, K. C.-C. Chang, and J. Han. Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach, In Proc. ACM SIGKDD Conference (KDD 2004), pages 148-157, 2004.
- Presenter: Maryam Karimzadehgan (Presentation Slides)
Paper W6: A. Doan, P. Domingos, and A. Halevy. Learning to Match the Schemas of Databases: A Multistrategy Approach, Machine Learning, 50(3): 279 - 301, 2003.
- Presenter: Shimin Guo (Presentation Slides)
Paper S4: L. Golab and M. T. Ozsu. Processing sliding window multi-joins in continuous queries over data streams. In Proc. 29th Int. Conf. on Very Large Data Bases, pages 500-511, 2003.
- Presenter: Yingying Tao (Presentation Slides)
Paper S5: A. Arasu and J. Widom. Resource sharing in continuous sliding-window aggregates. In Proc. 30th Int. Conf. on Very Large Data Bases, pages 336-347, 2004.
- Presenter: Hossein S. Attar (Presentation Slides)
Paper S6: B. Babcock, S. Babu, M. Datar, and R. Motwani. Chain: Operator scheduling for memory minimization in data stream systems. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 253-264, 2003.
- Presenter: Kareem El-Gebaly (Presentation Slides)
Paper W7: R. McCann, B. K AlShebli, Q. Le, H. Nguyen, L. Vu, A. Doan. Mapping Maintenance for Data Integration Systems. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005.
- Presenter: Laurent Charlin (Presentation Slides)
Paper W8: E. Rahm, and P. A. Bernstein. A survey of approaches to automatic schema matching, The VLDB Journal, 10(3): 334-350, 2001.
- Presenter: Joel So (Presentation Slides)
Paper W9: Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld. An Adaptive Query Execution System for Data Integration, In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999.
- Presenter: Anand Subramanian (Presentation Slides)
Paper S7: R. Avnur and J. Hellerstein. Eddies: Continuously adaptive query processing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 261-272, 2000.
- Presenter: Bojan Jovanovic (Presentation Slides)
Paper S8: A. Ayad and J. Naughton. Static optimization of conjunctive queries with sliding windows over unbounded streaming information sources. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 419- 430, 2004.
- Presenter: Maryam Karimzadehgan (Presentation Slides)
Paper S9: S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In Proc. 28th Int. Conf. on Very Large Data Bases, pages 203-214, 2002.
- Presenter: George Beskales (Presentation Slides)
Paper W10: A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions, In Proc. Int. Conf. on Very Large Data Bases, 1996.
- Presenter: Mirza Beg (Presentation Slides)
Paper W11: M. Lenzerini. Data Integration - A Theoretical Perspective, In Proc. ACM Symp. on Principles of Database Systems, 2002.
- Presenter: Kareem El Gebaly (Presentation Slides)
Paper W12: S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and Efficient Fuzzy Match for Online Data Cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003.
- Presenter: Aaditeshwar Seth (Presentation Slides)
Paper S10: S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom. Adaptive ordering of pipelined stream Filters. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 407-418, 2004.
- Presenter: Shimin Guo (Presentation Slides)
Paper S11: J-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. In Proc. 21st Int. Conf. on Data Engineering, pages 779-790, 2005.
- Presenter: Anand Subramanian (Presentation Slides)
Paper S12: N. Shivakumar and H. Garcia-Molina. Wave-indices: indexing evolving databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 381-392, 1997.
- Presenter: Mirza Beg (Presentation Slides)
Paper W13: L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang. Optimizing Queries Across Diverse Data Sources, In Proc. Int. Conf. on Very Large Data Bases, pages 276-285, 1997.
- Presenter: Kevin Quan (Presentation Slides)
Paper W14: R. Fagin, A. Lotem, M. Naor. Optimal aggregation algorithms for middleware, In Proc. 20th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, 2001.
- Presenter: Weihan Wang (Presentation Slides)
Paper W15: A. Y. Halevy, O. Etzioni, A. Doan, Z. G. Ives, J. Madhavan, L. McDowell, I. Tatarinov. Crossing the Structure Chasm, In Proc. Conf. on Innovative Data Systems Research (CIDR), 2003
- Presenter: Abram Hindle (Presentation Slides)
Paper S13: M. Balazinska, H. Balakrishnan, S. Madden, and M. Stonebraker. Fault tolerance in the Borealis distributed stream processing system. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 13-24, 2005.
- Presenter: Weihan Wang (Presentation Slides)
Paper S14: C. Estan and G. Varghese. New directions in traffic measurement and accounting. In Proc. SIGCOMM Conference, pages 323-336, 2002.
- Presenter: Aaditeshwar Seth (Presentation Slides)
Paper S15: D. Kifer, S. Ben-David, J. Gehrke. Detecting change in data streams. In Proc. 30th Int. Conf. on Very Large Data Bases, pages 180-191, 2004.
- Presenter: Laurent Charlin (Presentation Slides)
University of Waterloo |
Computer Science |
M.T. Özsu |
CS 856 Home Page |