Study Topics and Reading Material
The following is a list of some papers on the two topics that we will cover in the course.
The listed papers include those that I find interesting. If you find other papers that you would like to read and study, please contact me. I have focused mostly (but not
exclusively) on conference publications. This is not because they are more
important, but only because they are shorter and may be easier to handle. Please note that the publications listed under Overview subsections of each section are not to be presented by anyone, but they are to be read by everyone.
Most of the papers listed below can be accessed (and searched) on-line from UW machines -- UW maintains a campus-wide subscription to the ACM Digital Library and IEEE Digital Library, so you should be able to search it and retrieve from it if you are coming from any machine on the UW campus network. Papers published in ACM journals and proceedings can be accessed through the ACM Digital Library while those published in IEEE sources can be obtained from IEEE Xplore. Springer publications (e.g., Lecture Notes in Computer Science - LNCS) can be obtained from Springer LINK. You may also be interested in exploring Michael Ley's databases and logic programming bibliography server (it actually has much wider coverage than its name suggests), which is searchable and contains many links to on-line papers. If the paper can only be obtained from another source, I try to provide a link to the original source (usually from the paper's title).
Data Integration*
Overview
- A. D. Jhingran, N. Mattos, and H. Pirahesh. Information integration: A research agenda. IBM Systems Journal, 41(4), 2002.
- M. A. Roth, D. C. Wolfson, J. C. Kleewein, and C. J. Nelin. Information integration: A new generation of information technology. IBM Systems Journal, 41(4), 2002.
- A. Doan and A. Halevy. Semantic Integration Research in the Database Community: A Brief Survey, AI Magazine, Special Issue on Semantic Integration, Spring 2005.
- A. Y. Halevy, N. Ashish, D. Bitton, M. Carey, D. Draper, J. Pollock, A. Rosenthal, V. Sikka. Enterprise Information Integration: Successes, Challenges, and Controversies, In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005.
- Special issue on Information Integration on the Web, IEEE Intelligent Systems, Sept/Oct. 2003.
- Special Issue on Information Integration, IEEE Q. Bulletin on Data Engineering, September 2002.
Schema matching
- E. Rahm, and P. A. Bernstein. A survey of approaches to automatic schema matching,The VLDB Journal, 10(3): 334-350, 2001.
- S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity Flooding: A Versatile Graph Matching Algorithm, In Proc. Int. Conf. on Data Engineering, 2002.
- A. Doan, P. Domingos, and A. Halevy. Learning to Match the Schemas of Databases: A Multistrategy Approach, Machine Learning, 50(3): 279 - 301, 2003.
- R. McCann, B. AlShelbi, Q. Le, H. Nguyen, L. Vu, and A. Doan. Maveric: Mapping Maintenance for Data Integration Systems, In Proc. Int. Conf. on Very Large Data Bases, 2005.
- J. Madhavan, P. A. Bernstein, and E. Rahm. Generic Schema Matching with CUPID, In Proc. Int. Conf. on Very Large Data Bases, 2001.
- R. J. Miller, L. M. Haas, and M. A. Hernandez. Schema Mapping as Query Discovery, In Proc. Int. Conf. on Very Large Data Bases, 2000 .
- R. Krishnamurthy, W. Litwin, and W. Kent. Language Features for Interoperability of Databases with Schematic Discrepancies, In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 40-49, 1991.
- K. C.-C. Chang, B. He, and Z. Zhang. Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web, In Proc. 2nd Conf. on Innovative Data Systems Research (CIDR 2005), January 2005.
- B. He, K. C.-C. Chang, and J. Han. Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach, In Proc. ACM SIGKDD Conference (KDD 2004), pages 148-157, 2004.
- Z. Zhang, B. He, and K. C.-C. Chang. Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax, In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004.
- W. Wu, C. Yu, A. Doan, and W. Meng. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web, In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004.
Schema mapping
- L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernández, and R. Fagin.Translating Web Data. In Proc. Int. Conf. on Very Large Data Bases, pages 598-609, 2002.
- A. Kementsietsidis, M. Arenas, R. J. Miller: Mapping Data in Peer-to-Peer Systems. Semantics and Algorithmic Issues, In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 325-336, 2003.
- Y. Velegrakis, R. J. Miller, L. Popa. Preserving mapping consistency under schema changes, The VLDB Journal, 13(3): 274-293, 2004.
Data cleaning
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C-A. Saita. Declarative Data Cleaning: Language, Models, and Algorithms, In Proc. Int. Conf. on Very Large Data Bases, 2001.
- V. Raman and J. M. Hellerstein. Potter's Wheel: An Interactive Data Cleaning System, In Proc. Int. Conf. on Very Large Data Bases, 2001.
- M. Hernandez and S. Stolfo. Real-World Data is Dirty: Data Cleansing and The Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, 1997.
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and Efficient Fuzzy Match for Online Data Cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003.
Formalisms for Data Integration
- J. D. Ullman. Information Integration Using Logical Views, In Proc. Int. Conf. on Database Theory, pages 19-40, 1997.
- M. Lenzerini. Data Integration - A Theoretical Perspective, In Proc. ACM Symp. on Principles of Database Systems, 2002.
- A. Y. Levy. Answering Queries Using Views: A Survey, The VLDB Journal, 2001.
- A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying Heterogeneous Information Sources Using Source Descriptions, In Proc. Int. Conf. on Very Large Data Bases, 1996.
- R. Pottinger and A. Levy. A Scalable Algorithm for Answering Queries Using Views, The VLDB Journal, 2001.
- O. Duschka, M. Genesereth, and A. Levy. Recursive Query Plans for Data Integration. Journal of Logic Programming, 2000.
- L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. SchemaSQL: An extension to SQL for multidatabase interoperability, ACM Trans. Database Syst., 26(4), December 2001.
- J. Gryz. Query Rewriting Using Views in the Presence of Functional and Inclusion Dependencies, Information Systems, 24(7): 597-612, 1999.
- R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data Exchange: Semantics and Query Answering, Theor. Comput. Sci. 336(1): 89-124, 2005.
- Y.Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. Ullman. A Query Translation Scheme for Rapid Implementation of Wrappers, In Proc. Int. Conf. on Deductive and Object-Oriented Databases (DOOD), 1995.
- Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-Based Query Rewriting in Mediator Systems. Distributed And Parallel Databases, 6(1), 1998.
- A. Levy, A. Rajaraman, and J. Ullman. Answering Queries Using Limited External Query Processors. Journal of Computer and System Sciences, 58(1): 69-82, 1999.
Query Evaluation for Distributed Data Sources
- L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang.
Optimizing Queries Across Diverse Data Sources, In Proc. Int. Conf. on Very Large Data Bases, pages 276-285, 1997.
- Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld. An Adaptive Query Execution System for Data Integration, In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999.
- Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld.
Adapting to Source Properties in Processing Data Integration Queries, In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 395-406, 2004.
Combining rank information
- R. Fagin. Combining Fuzzy Information: an Overview,ACM SIGMOD Record, 31(2), 2002.
- R. Fagin, A. Lotem, M. Naor. Optimal aggregation algorithms for middleware, In
Proc. 20th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, 2001 .
- W.F. Cody, L.M. Haas, W. Niblack, M. Arya, M.J. Carey, R. Fagin, D. Lee, D. Petkovic, P.M. Schwarz, J. Thomas, M. Tork Roth, J.H. Williams and E.L. Wimmers. Querying Multimedia Data from Multiple Repositories by Content: The Garlic Project, In Proc. IFIP 2.6 Third Working Conference on Visual Database Systems (VDB-3), 1995.
- R. Fagin. Combining Fuzzy Information From Multiple Systems, Journal of Computer and System Sciences, 58(1): 83-99, 1999.
- E.L. Wimmers, L.M. Haas, M. Tork Roth, Ch. Braendli. Using Fagin's Algorithm for Merging Ranked Results in Multimedia Middleware, In Proc. 4th IFCIS Int. Conf. on Cooperative Information Systems, pages 267-278, 1999.
- R. Fagin and E.L. Wimmers. A Formula for Incorporating Weights into Scoring Rules, In Proc. 6th Int. Conf. Database Theory, pages 247-261, 1997.
- R. Fagin. Fuzzy Queries in Multimedia Database Systems. In
Proc. 17th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, 1998.
Structured vs Unstructured Data
- A. Y. Halevy, O. Etzioni, A. Doan, Z. G. Ives, J. Madhavan, L. McDowell, I. Tatarinov.
Crossing the Structure Chasm, In Proc. Conf. on Innovative Data Systems Research (CIDR), 2003.
Stream Data Management
Surveys and Overview
- B. Babcock, S. Babu, M. Datar, R. Motwani, and
J. Widom. Models and issues in data streams. In
Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp.
Principles of Database Systems, pages 1-16, 2002.
- L. Golab and M. T. Ozsu. Issues in data stream management.
SIGMOD Record, 32(2):5-14, 2003.
- S. Muthukrishnan, Data streams: Algorithms and applications, Unpublished Manuscript.
- N. Koudas and D. Srivastava. Data stream query processing. Tutorial presented at IEEE International Conference on Data Engineering (ICDE), page 1145, 2005.
Systems
- D. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J-H. Hwang, W. Lindner, A. Rasin, N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis stream processing engine. In Proc. 1st Biennial Conf. on Innovative Data Syst. Res., 2005.
- S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. Shah. TelegraphCQ: Continuous data ow processing for an uncertain world. In Proc. 1st Biennial Conf. on Innovative Data Syst. Res, pages 269-280, 2003.
- M. Hammad, M. Mokbel, M. Ali, W. Aref, A. Catlin, A. Elmagarmid, M. Eltabakh, M. Elfeky, T. Ghanem, R. Gwadera, I. Ilyas, M. Marzouk, and X. Xiong. Nile: a query processing engine for data streams. In Proc. 20th Int. Conf. on Data Engineering, page 851, 2004.
- J. Kramer and B. Seeger. PIPES - a public infrastructure for processing and exploring streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 925-926, 2004.
- R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data stream management system. In Proc. 1st Biennial Conf. on Innovative Data Syst. Res, pages 245-256, 2003.
- E. Rundensteiner, L. Ding, T. Sutherland, Y. Zhu, B. Pielech, and N. Mehta. CAPE: continuous query engine with heterogeneous-grained adaptivity. In Proc. 30th Int. Conf. on Very Large Data Bases, pages 1353- 1356, 2004.
- S. Babu and J. Widom. StreaMon: an adaptive engine for stream query processing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 931-932, 2004.
- C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: high performance network monitoring with an SQL interface. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 647-651, 2003.
Query Algebras and Languages
- D. Abadi, D. Carney, U. Cetintemel, M. Cherniack,
C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and
S. Zdonik. Aurora: A new model and architecture
for data stream management. The VLDB Journal,
12(2):120-139, Aug 2003.
- A. Arasu, S. Babu, and J. Widom. The CQL continuous
query language: Semantic foundations and query
execution. Technical Report 2003-67, Stanford University,
2003.
- J. Kramer and B. Seeger. A temporal foundation for
continuous queries over data streams. In Proc. 11th
Int. Conf. on Management of Data (COMAD), 2005.
- Y-N. Law, H. Wang, and C. Zaniolo. Query languages
and data models for database sequences and
data streams. In Proc. 30th Int. Conf. on Very Large
Data Bases, pages 492-503, 2004.
- P. Tucker, D. Maier, T. Sheard, and L. Faragas.
Exploiting punctuation semantics in continuous data
streams. IEEE Trans. Knowledge and Data Eng.,
15(3):555-568, 2003.
Sliding Window Query Processing
- L. Golab and M. T. Ozsu. Update-pattern-aware modeling and processing of continuous queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 658-669, 2005.
- A. Ayad and J. Naughton. Static optimization of conjunctive
queries with sliding windows over unbounded
streaming information sources. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 419-
430, 2004.
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining
stream statistics over sliding windows. In Proc.
13th SIAM-ACM Symp. on Discrete Algorithms, pages
635-644, 2002.
- L. Golab and M. T. Ozsu. Processing sliding window
multi-joins in continuous queries over data streams. In
Proc. 29th Int. Conf. on Very Large Data Bases, pages
500-511, 2003.
- L. Golab, D. DeHaan, E. Demaine, A. Lopez-Ortiz, and I. Munro. Identifying frequent items in sliding windows over on-line packet streams. In Proc. Internet Measurement Conference, pages 173-178, 2003.
- J. Kang, J. Naughton, and S. Viglas. Evaluating window
joins over unbounded streams. In Proc. 19th Int.
Conf. on Data Engineering, pages 341-352, 2003.
- U. Srivastava and J. Widom. Memory-limited execution
of windowed stream joins. In Proc. 30th Int. Conf.
on Very Large Data Bases, pages 324-335, 2004.
- J. Li, D. Maier, K. Tufte, V. Papadimos, and P. Tucker. Semantics and evaluation techniques for window aggregates in data streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 311-322, 2005.
- A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, pages 286-296, 2004.
Adaptive Query Processing
- R. Avnur and J. Hellerstein. Eddies: Continuously
adaptive query processing. In Proc. ACM SIGMOD
Int. Conf. on Management of Data, pages 261-272,
2000.
- S. Madden, M. Shah, J. Hellerstein, V. Raman, Continuously Adaptive Continuous Queries over Streams, In Proc. ACM Int. Conf. on Management of Data, 2002.
- S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and
J. Widom. Adaptive ordering of pipelined stream Filters.
In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 407-418, 2004.
- S. Babu, K. Munagala, J. Widom, and R. Motwani.
Adaptive caching for continuous queries. In Proc. 21st
Int. Conf. on Data Engineering, 2005.
- A. Deshpande and J. Hellerstein. Lifting the burden of
history from adaptive query processing. In Proc. 30th
Int. Conf. on Very Large Data Bases, pages 948-959,
2004.
- A. Gounaris, N. Paton, A. Fernandes, and R. Sakellariou.
Adaptive query processing: A survey. In Proc. 19th
British Nat. Conf. on Databases, pages 11-25, 2002.
- M. F. Mokbel, M. Lu, and W. Aref. Hash-merge join:
A non-blocking join algorithm for producing fast and
early join results. In Proc. 20th Int. Conf. on Data
Engineering, pages 251-262, 2004.
- F. Reiss and J. Hellerstein. Data triage: an adaptive
architecture for load shedding in telegraphCQ. In
Proc. 21st Int. Conf. on Data Engineering, 2005.
- S. Viglas, J. Naughton, and J. Burger. Maximizing
the output rate of multi-join queries over streaming
information sources. In Proc. 29th Int. Conf. on Very
Large Data Bases, pages 285-296, 2003.
- J. Chen, D. DeWitt, J. Naughton, Design and Evaluation of Alternative Selection Placement Strategies in Optimizing Continuous Queries, In Proc. International Conference on Data Engineering, 2002.
- Y. Zhu, E. Rundensteiner, and G. Heineman. Dynamic plan migration for continuous queries over data streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 431-442, 2004.
Operator Scheduling
- B. Babcock, S. Babu, M. Datar, and R. Motwani.
Chain: Operator scheduling for memory minimization
in data stream systems. In Proc. ACM SIGMOD Int.
Conf. on Management of Data, pages 253-264, 2003.
- D. Carney, U. Cetintemel, A. Rasin, S. Zdonik,
M. Cherniack, and M. Stonebraker. Operator scheduling
in a data stream manager. In Proc. 29th Int. Conf.
on Very Large Data Bases, pages 838-849, 2003.
Load Shedding
- B. Babcock, M. Datar, and R. Motwani. Load shedding
for aggregation queries over data streams. In
Proc. 20th Int. Conf. on Data Engineering, pages 350-
361, 2004.
- N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack,
and M. Stonebraker. Load shedding in a data stream
manager. In Proc. 29th Int. Conf. on Very Large Data
Bases, pages 309-320, 2003.
- A. Das, J. Gehrke, and M. Riedewald,
Approximate join processing over data streams,
In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003.
- F. Reiss and J. Hellerstein. Data triage: an adaptive architecture for load shedding in TelegraphCQ. In Proc. 21st Int. Conf. on Data Engineering, pages 155-156, 2005.
Archiving Data Streams on Disk
- S. Chandrasekaran and M. J. Franklin. Remembrance
of streams past: overload-sensitive management of
archived streams. In Proc. 30th Int. Conf. on Very
Large Data Bases, pages 348-359, 2004.
- N. Shivakumar and H. Garcia-Molina. Wave-indices: indexing evolving databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 381-392, 1997.
Shared Processing of Continuous Queries
- J. Chen, D. DeWitt, F. Tian, and Y. Wang. NiagaraCQ:
A scalable continuous query system for internet
databases. In Proc. ACM SIGMOD Int. Conf. on
Management of Data, pages 379-390, 2000.
- S. Krishnamurthy, M. J. Franklin, J. M. Hellerstein,
and G. Jacobson. The case for precision sharing. In
Proc. 30th Int. Conf. on Very Large Data Bases, pages
972-986, 2004.
- R. Zhang, N. Koudas, B. Chin Ooi, and D. Srivastava. Multiple aggregations over data streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 299-310, 2005.
- A. Arasu and J. Widom. Resource sharing in continuous sliding-window aggregates. In Proc. 30th Int. Conf. on Very Large Data Bases, pages 336-347, 2004.
- M. Hammad, M. J. Franklin, W. Aref, and A. Elmagarmid. Scheduling for shared window joins over data streams. In Proc. 29th Int. Conf. on Very Large Data Bases, pages 297-308, 2003.
- S. Chandrasekaran and M. Franklin. Streaming queries over streaming data. In Proc. 28th Int. Conf. on Very Large Data Bases, pages 203-214, 2002.
Distributed Streams
- U. Srivastava and J. Widom. Flexible time management
in data stream systems. In Proc.
ACM SIGMOD-SIGACT-SIGART Symp. Principles
of Database Systems, pages 263-274, 2004.
- J-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. In Proc. 21st Int. Conf. on Data Engineering, pages 779-790, 2005.
- Y. Xing, S. Zdonik, and J-H. Hwang. Dynamic load distribution in the Borealis stream processor. In Proc. 21st Int. Conf. on Data Engineering, pages 791-802, 2005.
- M. Balazinska, H. Balakrishnan, S. Madden, and M. Stonebraker. Fault tolerance in the Borealis distributed stream processing system. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 13-24, 2005.
Data Stream Mining
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proc. ACM Symp. on Theory of Computing, pages 20-29, 1996.
- D. Kifer, S. Ben-David, J. Gehrke. Detecting change in data streams. In Proc. 30th Int. Conf. on Very Large Data Bases, pages 180-191, 2004.
- N. Koudas, B. Chin Ooi, K-L. Tan, and R. Zhang. Approximate NN queries on streams with guaranteed error/performance bounds.In Proc. 30th Int. Conf. on Very Large Data Bases, pages 804-815, 2004.
- T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In Proc. 20th Int. Conf. on Data Engineering, pages 338-349, 2004.
- G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. In Proc. Latin American Theoretical Informatics (LATIN), pages 29-38, 2004.
- C. Estan and G. Varghese. New directions in traffic measurement and accounting. In Proc. SIGCOMM Conference, pages 323-336, 2002.
- G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. 28th Int. Conf. on Very Large Data Bases, pages 346-357, 2002
- Y. Zhu and D. Shasha. StatStream: statistical monitoring of thousands of data streams in real time. In Proc. 28th Int. Conf. on Very Large Data Bases, pages 358-369, 2002
- Y. Zhu and D. Shasha. Efficient elastic burst detection in data streams. In Proc. Knowledege Disc. and Data Mining (KDD), pages 336-345, 2003.
* Parts of this reading list are borrowed from those generated for their courses or suggested by the following colleagues : Jan Chomicki, Renee J. Miller, Zack Ives, Alon Halevy, David Embley, ...