Papers to Read
The following is a list of some papers on the two topics that we will cover in the course.
The listed papers include those that I find interesting. If you find other papers that you would like to read and study, please contact me. I have focused mostly (but not
exclusively) on conference publications. This is not because they are more
important, but only because they are shorter and may be easier to handle. Please note that the publications listed under Overview subsections of each section are not to be presented by anyone, but they are to be read by everyone.
Most of the papers listed below can be accessed (and searched) on-line from UW machines -- UW maintains a campus-wide subscription to the ACM Digital Library and IEEE Digital Library, so you should be able to search it and retrieve from it if you are coming from any machine on the UW campus network.
- Papers published in ACM journals and proceedings can be accessed through the ACM Digital Library
- Papers published in IEEE sources can be obtained from IEEE Xplore.
- Springer publications (e.g., Lecture Notes in Computer Science - LNCS) can be obtained from Springer LINK.
- You may also be interested in exploring Michael Ley's DBLP server, which is searchable and contains many links to on-line papers.
If the paper can only be obtained from another source, I try to provide a link to the original source (usually from the paper's title).
Data Partitioning/Placement
- J. Zhou, N. Bruno, and W. Lin. Advanced partitioning techniques for massively distributed computation. Proc. ACM SIGMOD International Conference on Management of Data, pages 13-24, 2012.
- A. Pavlo, C. Curino, and S. Zdonik. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. Proc. ACM SIGMOD International Conference on Management of Data, pages 61-72, 2012.
- Mohamed Y. Eltabakh, Yuanyuan Tian, and Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, and John McPherson.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. Proc. VLDB, 4(9): 575-585, 2011.
Distributed Queries
- L. Amsaleg, M. Franklin, A. Tomasic, Dynamic Query Operator Scheduling
for Wide-Area Remote Access, Distributed and Parallel Databases, 6(3):
217-246, 1998.
- A. Halevy, Answering queries using views: A survey, VLDB J., 10(4): 270-294, 2001.
- R. Avnur and J. M. Hellerstein, Eddies: Continuously adaptive query processing, Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 261-272, 2000.
- M. A. Shah, J. M. Hellerstein, S. Chandrasekara, and M. J. Franklin, Flux: An adaptive partitioning operator for continuous query systems, Proc. 19th Int. Conf. On Data Engineering, pages 25-36, 2003.
- F. Porto, E.S. Laber, and P. Valduriez, Cherry picking: A semantic query processing strategy for the evaluation of expensive predicates, Proc. Brazilian Symposium on Databases, pages 356-370, 2003.
- F. Tian and D. J. DeWitt, Tuple routing strategies for distributed Eddies, Proc. 29th Int. Conf. On Very Large Data Bases, pages 333-344, 2003.
- J. R. Thomsen, M. L. Yiu, and C. S. Jensen. Effective caching of shortest paths for location-based services. Proc. ACM SIGMOD International Conference on Management of Data, pages 313-324, 2012.
- H. Herodotou, N. Borisov, and S. Babu. Query optimization techniques for partitioned tables. Proc. ACM SIGMOD International Conference on Management of Data, pages 49-60, 2011.
Distributed Transactions
- Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Daniel J. Abadi: Calvin: fast distributed transactions for partitioned database systems, Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, 2012.
- Daniel Peng, Frank Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI, pages 251-264, 2010.
- Jun Rao, Eugene J. Shekita, Sandeep Tata: Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore. Proc. VLDB, 4(4): 243-254, 2011.
- Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica: Probabilistically Bounded Staleness for Practical Partial Quorums. Proc. VLDB, 5(8): 776-787, 2012.
- A. Thomson, T. Diamond, S-C. Weng, K. Ren, P. Shao, and Daniel J. Abadi. Calvin: fast distributed transactions for partitioned database systems. Proc. ACM SIGMOD International Conference on Management of Data, pages 1-12, 2012.
- A. Pavlo, E. P.C. Jones, and S. Zdonik.
On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems. Proc. VLDB, 5(2): 85-96, 2012.
- H.T. Vo, S. Wang, D. Agrawal, G. Chen, B.C. Ooi.
LogBase: A Scalable Log-structured Database System in the Cloud. Proc. VLDB, 5(10): 1004-1015, 2012.
- Stacy Patterson, Aaron J. Elmore, Faisal Nawab, Divyakant Agrawal, Amr El Abbadi.
Serializability, not Serial: Concurrency Control and Availability in Multi-Datacenter Datastores. Proc. VLDB, 5(11): 1459-1470, 2012.
- Ippokratis Pandis, Pınar Tözün, Ryan Johnson, and Anastasia Ailamaki.
PLP: Page Latch-free Shared-everything OLTP.Proc. VLDB, 4(10): 610-621, 2011.
Data Replication
- Yuri Breitbart, Raghavan Komondoor, Rajeev Rastogi, S. Seshadri, Abraham Silberschatz: Update Propagation Protocols for Replicated Databases,Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 97-108, 1999.
- Carlo Curino, Yang Zhang, Evan P. C. Jones, Samuel Madden:Schism: a Workload-Driven Approach to Database Replication and Partitioning. Proc. VLDB, 3(1): 48-57, 2010.
- M. Waldvogel, P. Hurley, D. Bauer, Dynamic Replica Management in Distributed Hash Tables, IBM Technical Report RZ 3502, 2003.
- M. P. Consens, K. Ioannidou, J. LeFevre, and N. Polyzotis. Divergent physical design tuning for replicated databases. Proc. ACM SIGMOD International Conference on Management of Data, pages 49-60, 2012.
- Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica.
Probabilistically Bounded Staleness for Practical Partial Quorums. Proc. VLDB, 5(8): 776-787, 2012.
- Sudarshan Kadambi1, Jianjun Chen, Brian F. Cooper, David Lomax1, Raghu Ramakrishnan, Adam Silberstein, Erwin Tam, and Hector Garcia-Molina.
Where in the World is My Data?, Proc. VLDB, 4(11): 1040-1050, 2011.
Parallel Data Management
- F. Akal, K. Böhm, and H.-J. Schek, OLAP query evaluation in a database cluster: A performance study on intra-query parallelism, Proc. 6th East European Conf. Advances in Databases and Information Systems, pages 218-231, 2002.
- U. Röhm, K. Böhm, and H.-J. Schek, OLAP query routing and physical design in a database cluster, Advances in Database Technology, Proc. 7th Int. Conf. On Extending Database Technology, pages 254-268, 2000.
- A. Lima, M. Mattoso, and P. Valduriez, OLAP query processing in a database cluster, Proc. 20th Int. Euro-Par Conf., pages 355-362, 2004.
- C. Furtado, A. Lima, E. Pacitti, P. Valduriez and M. Mattoso, Physical and virtual partitioning in OLAP database clusters, Proc. Int. Symp. Computer Architecture and High Performance Computing, pages 143-150, 2005.
- C. Furtado, A. Lima, E. Pacitti, P. Valduriez and M. Mattoso, Adaptive hybird partitioning for OLAP query processing in a database cluster, Int. J. High Perf. Comput. And Networking, 5(4): 251-262, 2008.
- H. Köhler, J. Yang, and X. Zhou. Efficient parallel skyline processing using hyperplane projections. Proc. ACM SIGMOD International Conference on Management of Data, pages 85-96, 2011.
- P. Upadhyaya, Y.C. Kwon, and M. Balazinska. A latency and fault-tolerance optimizer for online parallel query plans. Proc. ACM SIGMOD International Conference on Management of Data, pages 241-252, 2011.
- E. Soroush, M. Balazinska, and D. Wang. ArrayStore: a storage manager for complex parallel array processing. Proc. ACM SIGMOD International Conference on Management of Data, pages 253-264, 2011.
- Martina-Cezara Albutiu, Alfons Kemper, Thomas Neumann.
Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems. Proc. VLDB, 5(10): 1064-1075, 2012.
Database Integration
- R. J. Miller, L. M. Haas, and M. A. Hernandez. Schema Mapping as Query Discovery, In Proc. Int. Conf. on Very Large Data Bases, 2000.
- A. Doan, P. Domingos, and A. Halevy. Learning to Match the Schemas of Databases: A Multistrategy Approach, Machine Learning, 50(3): 279 - 301, 2003.
- R. McCann, B. AlShelbi, Q. Le, H. Nguyen, L. Vu, and A. Doan. Maveric: Mapping Maintenance for Data Integration Systems, In Proc. Int. Conf. on Very Large Data Bases, 2005.
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C-A. Saita. Declarative Data Cleaning: Language, Models, and Algorithms, In Proc. Int. Conf. on Very Large Data Bases, 2001.
- V. Raman and J. Hellerstein, Potter's wheel: An interactive data cleaning system, Proc. 27th Int. Conf. On Very Large Data Bases, pages 381-390, 2001.
- S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and Efficient Fuzzy Match for Online Data Cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003.
- L. M. Haas, D. Kossmann, E. L. Wimmers, and J. Yang.
Optimizing Queries Across Diverse Data Sources, In Proc. Int. Conf. on Very Large Data Bases, pages 276-285, 1997.
- Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld. An Adaptive Query Execution System for Data Integration, In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999.
- Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld.
Adapting to Source Properties in Processing Data Integration Queries, Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 395-406, 2004.
- L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-driven schema mapping. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 73-84, 2012.
- H. Elmeleegy, A. Elmagarmid, and J. Lee. Leveraging query logs for schema mapping generation in U-MAP. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 121-132, 2011.
- B. Alexe, B. ten Cate, P. G. Kolaitis, and W-C. Tan. Designing and refining schema mappings via data examples. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 133-144, 2011.
- M. Zhang, M. Hadjieleftheriou, B. C.Ooi, C. M. Procopiuc, and D. Srivastava. Automatic discovery of attributes in relational databases. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 109-120. 2011.
- W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 469-480, 2011.
- Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng.
Entity Matching: How Similar Is Similar, Proc. VLDB, 4(10): 622-633, 2011.
- Vibhor Rastogi, Nilesh N. Dalvi, Minos N. Garofalakis.
Large-Scale Collective Entity Matching. Proc. VLDB, 4(4): 208-218, 2011.
Peer-to-Peer Data Management
- A. Kementsietsidis, M. Arenas, R. J. Miller: Mapping Data in Peer-to-Peer Systems. Semantics and Algorithmic Issues, In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 325-336, 2003.
- B. Yang, H. Garcia-Molina, Comparing Hybrid Peer-to-Peer Systems, In Proc. of 27th International Conference on Very Large Data Bases, 2001.
- A. Crespo, H. Garcia-Molina, Routing Indices For Peer-to-Peer Systems, In Proc. International Conference on Distributed Computing Systems, 2002.
- S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker, A Scalable Content-Addressable Network, In Proc. ACM SIGCOMM Conf. on Applications, Technologies, Architectures, and Protocols for Computer Communication, 2001.
- B.Y. Zhao, L. Huang, J. Stribling, S.C. Rhea, A. D. Joseph, and J.D. Kubiatowicz, Tapestry: A Resilient Global-Scale Overlay for Service Deployment, IEEE J. on Selected Areas in Comm., 22(1), January 2004.
- K. Aberer, P. Cudre-Mauroux, and M. Hauswirth, The Chatty Web: Emergent Semantics Through Gossiping, In Proc. 12th Int. World Wide Web Conf., 2003.
- B. Gedik and L. Liu, PeerCQ: A Decentralized and Self-Configuring
Peer-to-Peer Information Monitoring System. In Proc. 23rd Int.
Conf. on Distributed Computing Systems, 2003.
- W.S. Ng, B. C. Ooi, K-L Tan, and A. Zhou, PeerDB: A P2P-based System for Distributed Data Sharing. In Proc. 19th Int. Conf. on Data Eng., 2003.
- P. Kalnis, W.S. Ng, B. C. Ooi, D. Papadias, and K-L. Tan, An Adaptive Peer-to-Peer Network for Distributed Caching of OLAP Results, In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002.
Stream Data Management
- C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: high performance network monitoring with an SQL interface. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 647-651, 2003.
- E. Rundensteiner, L. Ding, T. Sutherland, Y. Zhu, B. Pielech, and N. Mehta. CAPE: continuous query engine with heterogeneous-grained adaptivity. In Proc. 30th Int. Conf. on Very Large Data Bases, pages 1353- 1356, 2004.
- D. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J-H. Hwang, W. Lindner, A. Rasin, N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis stream processing engine. In Proc. 1st Biennial Conf. on Innovative Data Syst. Res., 2005.
- L. Golab and M. T. Özsu. Update-pattern-aware modeling and processing of continuous queries. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 658-669, 2005.
- A. Ayad and J. Naughton. Static optimization of conjunctive
queries with sliding windows over unbounded
streaming information sources. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 419-
430, 2004.
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining
stream statistics over sliding windows. In Proc.
13th SIAM-ACM Symp. on Discrete Algorithms, pages
635-644, 2002.
- L. Golab and M. T. Ozsu. Processing sliding window
multi-joins in continuous queries over data streams. In Proc. 29th Int. Conf. on Very Large Data Bases, pages
500-511, 2003.
- B. Babcock, S. Babu, M. Datar, and R. Motwani.
Chain: Operator scheduling for memory minimization
in data stream systems. In Proc. ACM SIGMOD Int.
Conf. on Management of Data, pages 253-264, 2003.
- D. Carney, U. Cetintemel, A. Rasin, S. Zdonik,
M. Cherniack, and M. Stonebraker. Operator scheduling
in a data stream manager. In Proc. 29th Int. Conf.
on Very Large Data Bases, pages 838-849, 2003.
- N. Tatbul, U. Cetintemel, S. Zdonik, M. Cherniack,
and M. Stonebraker. Load shedding in a data stream
manager. In Proc. 29th Int. Conf. on Very Large Data
Bases, pages 309-320, 2003.
- J-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. In Proc. 21st Int. Conf. on Data Engineering, pages 779-790, 2005.
MapReduce-based Data Management
- Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google file system. SOSP, pages 29-43, 2003.
- K. Shvachko, H. Kuang, S. Radia, R. Chansler, The Hadoop Distributed File System, IEEE 26th Symposium on Mass Storage Systems and Technologies, 2010
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1099-1110, 2008.
- Alan Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan Narayanam, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh Srivastava: Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience. Proc. VLDB 2(2): 1414-1425, 2009.
- Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon's Highly Available Key-Value Store. SOSP, 2007.
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber: Bigtable: A Distributed Storage System for Structured Data, ACM Trans. Comput. Syst., 26(2): Article 4, 2008.
- Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver, Ramana Yerneni: PNUTS: Yahoo!'s hosted data serving platform. Proc. 34th Int. Conf.
on Very Large Data Bases, pages 1277-1288, 2008.
- Iman Elghandour, Ashraf Aboulnaga.
ReStore: Reusing Results of MapReduce Jobs. Proc. VLDB, 5(6): 586-597, 2012.
This is a test