Papers to Read

The following is a list of some papers on the two topics that we will cover in the course. The listed papers include those that I find interesting. If you find other papers that you would like to read and study, please contact me. I have focused mostly (but not exclusively) on conference publications or publications of conference length. This is not because they are more important, but only because they are shorter and may be easier to handle.

Most of the papers listed below can be accessed (and searched) on-line from UW machines.

Papers published in ACM journals and proceedings can be accessed through the ACM Digital Library
Papers published in IEEE sources can be obtained from IEEE Xplore.
Springer publications (e.g., Lecture Notes in Computer Science - LNCS) can be obtained from Springer LINK.
You may also be interested in exploring DBLP server, which is searchable and contains many links to on-line papers.

Distributed Storage Systems

S. Ghemawat, H. Gobioff, S-H. Leung. The Google file system. Proc. 19th ACM Symposium on Operating Systems Principles, pages 29-43, 2003.
K. Shvachko, H. Kuang, S. Radia, R. Chansler, The Hadoop Distributed File System, IEEE 26th Symposium on Mass Storage Systems and Technologies, 2010.
Using Lustre with Apache Hadoop
S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: A scalable, high- performance distributed file system. In Proc. 7th USENIX Symp. on Operating System Design and Implementation, pages 307–320, 2006.
Badrish Chandramouli, Guna Prasaad, Donald Kossmann, Justin Levandoski, James Hunter, and Mike Barnett. FASTER: A Concurrent Key-Value Store with In-Place Updates. In Proc. ACM SIGMOD International Conference on Management of Data, pages 275-290, 2018.
A. Maccioni and R. Torlone. Augmented Access for Querying and Exploring a Polystore. In Proc. IEEE 34th International Conference on Data Engineering, pages 77-88, 2018.
Raghu Ramakrishnan, Baskar Sridharan, John R. Douceur, Pavan Kasturi, Balaji Krishnamachari-Sampath, Karthick Krishnamoorthy, Peng Li, Mitica Manu, Spiro Michaylov, Rogério Ramos, Neil Sharman, Zee Xu, Youssef Barakat, Chris Douglas, Richard Draves, Shrikant S. Naidu, Shankar Shastry, Atul Sikaria, Simon Sun, Ramarathnam Venkatesan. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics. Proc. ACM SIGMOD International Conference on Management of Data, pages 51-63, 2017.
Elena Kakoulli, Heredotos Herodotou. OctopusFS: A Distributed File System with Tiered Storage Management. Proc. ACM SIGMOD International Conference on Management of Data, pages 65-78, 2017.
Niv Dayan, Manos Athanassoulis, Stratos Idreos. Monkey: Optimal Navigable Key-Value Store. Proc. ACM SIGMOD International Conference on Management of Data, pages 79-94, 2017.
N. Bronson, et al., Tao: Facebook's Distributed Data Store For The Social Graph, Proc. USENIX Annual Technical Conference, pages 49-60, 2013

Background papers

Tran Doan Thanh, Subaji Mohan, Eunmi Choi1, SangBum Kim, Pilsung Kim. A Taxonomy and Survey on Distributed File Systems. Proc. Fourth International Conference on Networked Computing and Advanced Information Management, pages 144-149, 2008.
Martin Placek and Rajkumar Buyya. A Taxonomy of Distributed Storage Systems, 2007.

Main memory systems

Cagri Balkesen, Nitin Kunal, Georgios Giannikis, Pit Fender, Seema Sundara, Felix Schmidt, Jarod Wen, Sandeep R. Agrawal, Arun Raghavan, Venkatanathan Varadarajan, Anand Viswanathan, Balakrishnan Chandrasekaran, Sam Idicula, Nipun Agarwal, Eric Sedlar. RAPID: In-Memory Analytical Query Processing Engine with Extreme Performance per Watt. Proc. ACM SIGMOD International Conference on Management of Data, pages 1407-1419, 2018.
Alexander van Renen, Viktor Leis, Alfons Kemper, Thomas Neumann, Takushi Hashida, Kazuichi Oe, Yoshiyasu Doi, Lilian Harada, and Mitsuru Sato. Managing Non-Volatile Memory in Database Systems. In ACM SIGMOD International Conference on Management of Data, pages 1541-1555, 2018.
Viktor Leis, Michael Haubenschild, Alfons Kemper, Thomas Neumann. LeanStore: In-Memory Data Management beyond Main Memory. Proc. 34th IEEE International Conference on Data Engineering, pages 185-196, 2018.
Kayhan Dursun, Carsten Binnig, Ugur Çetintemel, Tim Kraska. Revisiting Reuse in Main Memory Database Systems. Proc. ACM SIGMOD International Conference on Management of Data, pages 1275-1289, 2017.
Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner, Thomas Willhalm, Grégoire Gomes. Memory Management Techniques for Large-Scale Persistent-Main-Memory Systems. Proc. VLDB Endow., 10(11): 1166-1177, 2017.
Norman May, Alexander Böhm, Wolfgang Lehner. SAP HANA - The Evolution of an In-Memory DBMS from Pure OLAP Processing Towards Mixed Workloads. Datenbanksysteme für Business, Technologie und Web (BTW 2017), pages 545-563, 2017.
C. Chasseur and J. M. Patel. Design and evaluation of storage organizations for read-optimized main memory databases. Proc. VLDB Endow., 6(13): 1474-1485, 2013.
Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, Mike Zwilling. Hekaton: SQL server's memory-optimized OLTP engine. Proc. ACM SIGMOD International Conference on Management of Data, pages 1243-1254, 2013.
Viktor Leis, Alfons Kemper, Thomas Neumann. The adaptive radix tree: ARTful indexing for main-memory databases. Proc. IEEE 27th International Conference on Data Engineering, pages 38-49, 2013.
Juchang Lee, Yong Sik Kwon, Franz Färber, Michael Muehle, Chulwon Lee, Christian Bensberg, Joo-Yeon Lee, Arthur H. Lee, Wolfgang Lehner. SAP HANA distributed in-memory database system: Transaction, session, and metadata management. Proc. IEEE 27th International Conference on Data Engineering, pages 1165-1173, 2013.
Michael Stonebraker and Ariel Weisberg. The VoltDB Main Memory DBMS. IEEE Data Eng. Bull., 36(2): 21-27, 2013.
Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, Jonathan Dees. The SAP HANA Database -- An Architecture Overview. IEEE Data Eng. Bull., 35(1): 28-33, 2012.
Alfons Kemper, Thomas Neumann. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. Proc. IEEE 29th International Conference on Data Engineering, pages 195-206, 2011.
Y. Li and J. M. Patel. BitWeaving: fast scans for main memory data processing. Proc. ACM SIGMOD International Conference on Management of Data, pages 289-300, 2013.
S. K. Begley, Z. He, and Y-P Chen. MCJoin: a memory-constrained join for column-store main-memory databases. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 121-132, 2012.
P. A. Boncz, M. L. Kersten, and S. Manegold. Breaking the memory wall in MonetDB. Commun. ACM, 51(12): 77-85, 2008.
Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex Rasin, Stanley B. Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, Daniel J. Abadi. H-store: a high-performance, distributed main memory transaction processing system. Proc. VLDB Endow., 1(2): 1496-1499, 2008.

Background papers

F. Färber, A. Kemper, P.A. Larson, J. Levandoski, T. Neumann, and A. Pavlo. Main Memory Database Systems. Foundations and Trends in Database Systems, 8(1-2): 1-130, 2017. (You can access this free of charge from UWaterloo computers)

MapReduce-based Data Management

Maaz Bin Safeer Ahmad, Alvin Cheung. Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications. Proc. ACM SIGMOD International Conference on Management of Data, pages 1205-1220, 2018.
M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica. Apache Spark: a unified engine for big data processing. Commun. ACM 59(11): 56-65, 2016.
M. Armbrust, T. Das, A. Davidson, A. Ghodsi, A. Or, J. Rosen, I. Stoica, P. Wendell, R. Xin, and M.Zaharia. Scaling Spark in the Real World: Performance and Usability. Proc. VLDB Endow., 8(12), 2015
F. Li, M. T. Özsu, G. Chen, B.C. Ooi. R-Store: A scalable distributed system for supporting real-time analytics. Proc. IEEE 30th International Conference on Data Engineering, pages 40-51, 2014.
C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins. Pig latin: a not-so-foreign language for data processing. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1099-1110, 2008.
A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, U. Srivastava. Building a High Level Dataflow System on top of MapReduce: The Pig Experience. Proc. VLDB Endow., 2(2): 1414-1425, 2009.
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow., 2(1): 922-933, 2009.
A. Okcan and M. Riedewald. Anti-combining for MapReduce, Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 839-850, 2014.
L. Chang, Z. Wang, T. Ma, L. Jian, L. Ma, A. Goldshuv, L. Lonergan, J. Cohen, C. Welton, G. Sherry, and M. Bhandarkar. HAWQ: a massively parallel processing SQL engine in Hadoop, Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1223-1234, 2014.
R. Sumbaly, J. Kreps, and S. Shah. The big data ecosystem at LinkedIn. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1125-1134, 2013.
K. Elmeleegy, C. Olston, and B. Reed. SpongeFiles: mitigating data skew in mapreduce using distributed memory. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 551-562, 2014.
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using Hadoop," Proc. IEEE 26th International Conference on Data Engineering, pages 996-1005, 2010.
Background papers
- F. Li, B.C. Ooi, M. T. Özsu, and S. Wu. Distributed data management using MapReduce. ACM Comput. Surv., 46(3): Article 31, 2014.
- Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. The family of mapreduce and large-scale data processing systems. ACM Comput. Surv., 46(1): Article 11, 2013.

Stream Processing Systems

Jon Gjengset, Malte Schwarzkopf, Jonathan Behrens, Lara Timbó Araújo, Martin Ek, Eddie Kohler, M. Frans Kaashoek, Robert Tappan Morris. Noria: dynamic, partially-stateful data-flow for high-performance web applications. In Proc. 13th USENIX Symposium on Operating Systems Design and Implementation, pages 213-231, 2018.
Jean-François Im, Kishore Gopalakrishna, Subbu Subramaniam, Mayank Shrivastava, Adwait Tumbde, Xiaotian Jiang, Jennifer Dai, Seunghyun Lee, Neha Pawar, Jialiang Li, and Ravi Aringunram. Pinot: Realtime OLAP for 530 Million Users. In Proc. ACM International Conference on Management of Data, pages 583-594, 2018.
Li Wang, Ruichu Cai, Tom Z. J. Fu, Jiong He, Zijie Lu, Marianne Winslett, Zhenjie Zhang. Waterwheel: Realtime Indexing and Temporal Range Query Processing over Massive Data Streams. Proc. 34th IEEE International Conference on Data Engineering, pages 269-280, 2018.
Olga Poppe, Chuan Lei, Salah Ahmed, Elke A. Rundensteiner. Complete Event Trend Detection in High-Rate Event Streams. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 109-124, 2017.
Qun Huang, Patrick P. C. Lee. Toward High-Performance Distributed Stream Processing via Approximate Fault Tolerance. Proc. VLDB Endow., 10(3): 73-84, 2016.
Aneesh Sharma, Jerry Jiang, Praveen Bommannavar, Brian Larson, and Jimmy Lin. 2016. GraphJet: real-time content recommendations at twitter. Proc. VLDB Endow. 9(13): 1281-1292, 2016.
Youhuan Li, Lei Zou, Huaming Zhang, Dongyan Zhao. Computing Longest Increasing Subsequences over Sequential Data Streams. Proc. VLDB Endow., 10(3): 181-192, 2016.
Pramod Bhatotia, Umut A. Acar, Flavio P. Junqueira, and Rodrigo Rodrigues. Slider: incremental sliding window analytics. In Proc. 15th International Middleware Conference, pages 61-72, 2014.
Haipeng Dai, Muhammad Shahzad, Alex X. Liu, Yuankun Zhong. Finding Persistent Items in Data Streams. Proc. VLDB Endow., 10(3): 289-300, 2016.
T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-Tolerant Stream Processing At Internet Scale. Proc. VLDB Endow., 6(11): 1033-1044, 2013.
L. Abraham, J. Allen, O. Barykin, V. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, J. L. Wiener, and O. Zed. Scuba: diving into data at facebook. Proc. VLDB Endow., 6(11): 1057-1067, 2013.
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed Stream Computing Platform. Proc. 2010 IEEE International Conference on Data Mining Workshops, pages 170-177, 2010.
M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. Proc. 4th USENIX conference on Hot Topics in Cloud Computing, pages 10-10, 2012.
G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin. 2013. Fast data in the era of big data: Twitter's real-time related query suggestion architecture. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1147-1158, 2013.
A. Chatzistergiou and S. D. Viglas. Fast Heuristics for Near-Optimal Task Allocation in Data Stream Processing over Clusters. Proc. 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1579-1588, 2014.
L. Golab, T. Johnson, and V. Shkapenyuk. Scalable Scheduling of Updates in Streaming Data Warehouses. IEEE Trans. Knowl. Data Eng., 24(6): 1092-1105, 2012.
L. Golab and T. Johnson. Consistency in a Stream Warehouse. Proc. Biennial Conference on Innovative Data Systems Research, pages 114-122, 2011.
S. Krishnamurthy, M. J. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, and N. Thombre. Continuous analytics over discontinuous streams. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1081-1092, 2010.
L. Golab, T. Johnson, J. S. Seidel, and V. Shkapenyuk. Stream warehousing with DataDepot. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 847-854, 2009.
Background papers
- L. Golab and M. T. Özsu.Data Stream Management, Morgan & Claypool, 2010. (You can access this free of charge from UWaterloo computers).
- X. Liu, N. Iftikhar, and X. Xie. Survey of real-time processing systems for big data. Proc. 18th International Database Engineering & Applications Symposium, pages 356-361, 2014.

Graph Analytics

Seongyun Ko, Wook-Shin Han. TurboGraph++: A Scalable and Fast Graph Analytics System. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 395-410, 2018.
Wenfei Fan, Jingbo Xu, Yinghui Wu, Wenyuan Yu, Jiaxin Jiang, Zeyu Zheng, Bohan Zhang, Yang Cao, and Chao Tian. Parallelizing Sequential Graph Computations. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 495-510, 2017.
Shiv Verma, Luke M. Leslie, Yosub Shin, Indranil Gupta. An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing. Proc. VLDB Endow., 10(5): 493-504, 2017.
Miao Qiao, Hao Zhang, Hong Cheng. Subgraph Matching: on Compression and Computation. Proc. VLDB Endow., 11(2): 176-188, 2017.
Christopher R. Aberger, Susan Tu, Kunle Olukotun, and Christopher Ré. EmptyHeaded: A Relational Engine for Graph Processing. In Proc. ACM SIGMOD International Conference on Management of Data, pages 431-446, 2016.
Mohamed S. Hassan, Walid G. Aref, Ahmed M. Aly. Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs. Proc. ACM SIGMOD International Conference on Management of Data, pages 1183-1197, 2016.
Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin. Speedup Graph Processing by Graph Ordering. In Proc. ACM SIGMOD International Conference on Management of Data, pages 1813-1828, 2016.
J. Mondal and A. Deshpande. Managing large dynamic graphs efficiently. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 145-156, 2012.
L. Qin, J. X.Yu, L. Chang, H. Cheng, C. Zhang, and X. Lin. Scalable big graph processing in MapReduce. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 827-838, 2014.
Z. Wang, Q. Fan, H. Wang, K-L. Tan, D. Agrawal, A. El Abbadi. Pagrol: Parallel graph olap over large-scale attributed graphs. Proc. IEEE 30th International Conference on Data Engineering, pages 496-507, 2014.
V. Satuluri, S. Parthasarathy, and Y. Ruan. Local graph sparsification for scalable clustering. Proc. ACM SIGMOD International Conference on Management of Data, pages 721-732, 2011.
J. Kim, W-S. Han, S. Lee, K. Park, H. Yu. OPT: a new framework for overlapped and parallel triangulation in large-scale graphs, Proc. ACM SIGMOD International Conference on Management of Data, pages 637-648, 2014.
W. Cui, Y. Xiao, H. Wang, W. Wang. Local search of communities in large graphs, Proc. ACM SIGMOD International Conference on Management of Data, pages 991-1002, 2014.
N. Satish, N. Sundaram, M.A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets, Proc. ACM SIGMOD International Conference on Management of Data, pages 979-990, 2014.
A. D. Zhu, W. Lin, S. Wang, X. Xiao. Reachability queries on large dynamic graphs: a total order approach, Proc. ACM SIGMOD International Conference on Management of Data, pages 1323-1334, 2014.
B. Shao, H. Wang, Y. Li. Trinity: a distributed graph engine on a memory cloud, Proc. ACM SIGMOD International Conference on Management of Data, pages 505-516, 2013.
L. Wang, Y. Xiao, B. Shao, H. Wang. How to partition a billion-node graph. Proc. IEEE 30th International Conference on Data Engineering, pages 568-579, 2014.

Background papers

D. Yan, Y. Bu, Y. Tian, and A. Deshpande. Big Graph Analytics Platforms. Foundations and Trends in Databases, 7(1-2): 1-195, 2017.
Heidari, Safiollah, Yogesh Simmhan, Rodrigo N. Calheiros, and Rajkumar Buyya. Scalable Graph Processing Frameworks: A Taxonomy and Open Challenges. ACM Computing Surveys, 51(3): Article 60, 2018.
Batarfi, Omar, Radwa El Shawi, Ayman G. Fayoumi, Reza Nouri, Ahmed Barnawi, and Sherif Sakr. Large scale graph processing systems: survey and an experimental evaluation. Cluster Computing, 18(3): 1189-1213, 2015.
Renzo Angles and Claudio Gutierrez. Survey of graph database models. ACM Comput. Surv., 40(1): Article 1, 2008.

Graph Database Systems

Arijit Khan, Gustavo Segovia, Donald Kossmann. On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage. In Proc. USENIX Annual Technical Conference, pages 401-412, 2018.
Kyoungmin Kim, In Seo, Wook-Shin Han, Jeong-Hoon Lee, Sungpack Hong, Hassan Chafi, Hyungyu Shin, Geonhwa Jeong. TurboFlux: A Fast Continuous Subgraph Matching System for Streaming Graph Data. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 411-426, 2018.
Zijian Li, Xun Jian, Xiang Lian, Lei Chen. An Efficient Probabilistic Approach for Graph Similarity Search. Proc. 34th IEEE International Conference on Data Engineering, pages 533-544, 2018.
Anurag Khandelwal, Zongheng Yang, Evan Ye, Rachit Agarwal, Ion Stoica. ZipG: A Memory-efficient Graph Store for Interactive Queries. Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 1149-1164, 2017.
Ayush Dubey, Greg D. Hill, Robert Escriva, and Emin Gün Sirer. Weaver: a high-performance, transactional graph database based on refinable timestamps. Proc. VLDB Endow. 9(11): 852-863, 2016.
J. Lee, W.S. Han, R. Kasperovics, and J.H. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. Proc. VLDB Endowment, 6(2): 133-144, 2013.
W. Fan, X. Wang, Y. Wu. Querying big graphs within bounded resources, Proc. ACM SIGMOD International Conference on Management of Data, pages 301-312, 2014.
Y. Shao, L. Chen, and B. Cui. Efficient cohesive subgraphs detection in parallel, Proc. ACM SIGMOD International Conference on Management of Data, pages 613-624, 2014.
Norbert Martínez-Bazan, M. Ángel Águila-Lorente, Victor Muntés-Mulero, David Dominguez-Sal, Sergio Gómez-Villamor, and Josep-L. Larriba-Pey. Efficient graph management based on bitmap indices. Proc. 16th International Database Engineering & Applications Symposium, pages 110-119, 2012.
Hilmi Yildirim, Vineet Chaoji, and Mohammed J. Zaki. 2010. GRAIL: scalable reachability index for large graphs. Proc. VLDB Endow. 3(1-2):276-284, 2010.
H. He and A.K. Singh. Graphs-at-a-time: query language and access methods for graph databases. In Proc. ACM SIGMOD International Conference on Management of Data, pages 405-418, 2008.

Background papers

Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan Reutter, and Domagoj Vrgoč. . Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv., 50(5): Article 68, 2017.
Renzo Angles and Claudio Gutierrez. Survey of graph database models. ACM Comput. Surv., 40(1): Article 1, 2008.

Machine Learning for Big Data Analytics

W. Wang, J. Gao, M. Zhang, G. Chen, T.K. Ng, B.C. Ooi, J. Shao, M. Reyad. Rafiki: Machine Learning as an Analytics Service System. Proc. VLDB Endow., 12(2): 128-140, 2018.
Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, Jure Leskovec. Graphrnn: Generating realistic graphs with deep auto-regressive models. Proc. International Conference on Machine Learning, pages 5694-5703, 2018.
Y. Park, J. Qing, X. Shen and B. Mozafari. BlinkML: Approximate Machine Learning with Probabilistic Guarantees. Proc. VLDB Endow., 12, 2018. To appear.
T. Kraska, A. Beutel, E. H. Chi, J. Dean, N. Polyzotis. The Case for Learned Index Structures. Proc. ACM SIGMOD International Conference on Management of Data, pages 489-504, 2018.
Teng Li, Zhiyuan Xu, Jian Tang, Yanzhi Wang. Model-free Control for Distributed Stream Data Processing using Deep Reinforcement Learning. Proc. VLDB Endow., 11(6): 705-718, 2018.
Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Prithviraj Sen, Alexandre V. Evfimievski, Niketan Pansare. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. Proc. VLDB Endow., 11(12): 1755-1768, 2018.
Hancheng Ge, Kai Zhang, Majid Alfifi, Xia Hu, James Caverlee. DisTenC: A Distributed Algorithm for Scalable Tensor Completion on Spark. Proc. 34th IEEE International Conference on Data Engineering. pages 137-148, 2018.
P. Bailis, E. Gan, S. Madden, D. Narayanan, R. Rong, S. Suri S. Macrobase: Prioritizing attention in fast data. Proc. ACM SIGMOD International Conference on Management of Data, pages541-556, 2017.
Fan Yang, Fanhua Shang, Yuzhen Huang, James Cheng, Jinfeng Li, Yunjian Zhao, Ruihao Zhao. LFTF: A Framework for Efficient Tensor Analytics at Scale. Proc. VLDB Endow., 10(7): 745-756, 2017.
William Hamilton, Ying Zhitao, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems (Proc. 31st Conference on Neural Information Processing Systems), pages 1024-1034, 2017.
T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M.J. Franklin, and M. Jordan. MLBase: A distributed machine learning system, Proc. 6th Biennial Conference on Innovative Data Systems Research, 2013.
M. Boehm1, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski1, F. M. Manshadi1, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, S. Tatikonda. SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow., 9(13): 1425-1436, 2016.
X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-RDBMS analytics. In Proc. ACM SIGMOD International Conference on Management of Data, pages 325-336, 2012.
Yongjoo Park, Ahmad Shahab Tajik, Michael J. Cafarella, Barzan Mozafari. Database Learning: Toward a Database that Becomes Smarter Every Time. Proc. ACM SIGMOD International Conference on Management of Data, pages 587-602, 2017.
Xiaogang Shi, Bin Cui, Yingxia Shao, and Yunhai Tong. Tornado: A System For Real-Time Iterative Analysis Over Evolving Data. In Proc. ACM SIGMOD International Conference on Management of Data, pages 417-430, 2016.
Shivaram Venkataraman, Zongheng Yang, Davies Liu, Eric Liang, Hossein Falaki, Xiangrui Meng, Reynold Xin, Ali Ghodsi, Michael Franklin, Ion Stoica, and Matei Zaharia. SparkR: Scaling R Programs with Spark. In Proc. ACM SIGMOD International Conference on Management of Data, pages 1099-1104, 2016.
B. Chandramoul, R.C. Fernandez, J. Goldstein, A. Eldawy, A. Quamar. Quill: Efficient, Transferable, and Rich Analytics at Scale. Proc. VLDB Endow., 9(14): 1623-1634, 2016.
Background papers
- Stephan Günnemann. Machine Learning Meets Databases. Datanbank-Spektrum, 17(1): 77-83, 2017.
- Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. Data Management Challenges in Production Machine Learning, Proc. ACM SIGMOD International Conference on Management of Data, pages 1723-1726, 2017. (Tutorial description)
- A. Kumar, M. Boehm, and J. Yang. Data Management in Machine Learning: Challenges, Techniques, and Systems. Proc. ACM SIGMOD International Conference on Management of Data, pages 1717-1722, 2017. (Tutorial description)

RDF Data Management

Gensheng Zhang, Damian Jimenez, Chengkai Li. Maverick: Discovering Exceptional Facts from Knowledge Graphs. Proc. ACM SIGMOD International Conference on Management of Data, pages 1317-1332, 2018.
Efrat Abramovitz, Daniel Deutch, Amir Gilad. Interactive Inference of SPARQL Queries Using Provenance. Proc. 34th IEEE International Conference on Data Engineering, pages 581-592, 2018.
Liang He, Bin Shao, Yatao Li, Huanhuan Xia, Yanghua Xiao, Enhong Chen, Liang Chen. Stylus: A Strongly-Typed Store for Serving Massive RDF Data. Proc. VLDB Endow., 11(2): 203-216, 2017.
Ibrahim Abdelaziz, Essam Mansour, Mourad Ouzzani, Ashraf Aboulnaga, Panos Kalnis. Lusail: A System for Querying Linked Data at Scale. Proc. VLDB Endow., 11(4): 485-498, 2017.
Weiguo Zheng, Jeffrey Xu Yu, Lei Zou, Hong Cheng. Question Answering Over Knowledge Graphs: Question Understanding Via Template Decomposition. Proc. VLDB Endow., 11(11): 1373-1386, 2018.
S. Gurajada, S. Seufert, I. Miliaraki, and M. Theobald. TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing. Proc. ACM SIGMOD International Conference on Management of Data, pages 289-300, 2014.
L. Zou, M. T. Özsu, L. Chen, X. Shen, R. Huang, and D. Zhao. gStore: a graph-based SPARQL query engine. The VLDB Journal, 23(4): 565-590, 2014.
P. Yuan, P. Liu, B. Wu, H. Jin W. Zhang and L. Liu. TripleBit: A fast and compact system for large scale RDF data, Proc. VLDB Endowment, 6(7): 517-528, 2013.
F. Goasdoué, Z. Kaoudi, I. Manolescu, J. Quiané-Ruiz and S. Zampetakis. CliqueSquare: Flat Plans for Massively Parallel RDF Queries. Proc. 31st IEEE International Conference on Data Engineering, 2015.
K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. A distributed graph engine for web scale RDF data. Proc. VLDB Endowment, 6(4): 265-276, 2013.
M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, B. Bhattacharjee. Building an efficient RDF store over a relational database, Proc. ACM SIGMOD International Conference on Management of Data, pages 121-132, 2013.
T. Neumann and G. Weikum. RDF-3X: a RISC-style engine for RDF. Proc. VLDB Endow., 1(1): 647-659, 2008.
K. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds. Efficient RDF storage and retrieval in Jena2. Proc. 1st Int. Workshop on Semantic Web and Databases, pages 131-150, 2003.
C. Weiss, P. Karras, and A. Bernstein. Hexastore: sextuple indexing for semantic web data management. Proc. VLDB Endow., 1(1):1008-1019, 2008.
J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic architecture for storing and querying RDF and RDF schema. Proc. 1st International Semantic Web Conference, pages 54-68, 2002.
A. Harth, J. Umbrich, A. Hogan, and S. Decker. YARS2: A federated repository for querying graph structured data from the web. Proc. 6th International Semantic Web Conference, pages 211-224, 2007.
K. Zeng, J. Yang, H. Wang, B. Shao, and Z. Wang. A distributed graph engine for web scale RDF data. Proc. VLDB Endow., 6(4): 265-276, 2013.
P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu. TripleBit: a fast and compact system for large scale RDF data. Proc. VLDB Endow., 6(7): 517-528, 2013.
K. Lee and L. Liu. Scaling queries over big RDF graphs with semantic hash partitioning. Proc. VLDB Endow., 6(14): 1894-1905, 2013.
D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable semantic web data management using vertical partitioning. Proc. 33rd International Conference on Very Large Data Bases, pages 411-422, 2007.
F. Prasser, A. Kemper, and K. A. Kuhn. Efficient distributed query processing for autonomous RDF databases. Proc. 15th International Conference on Extending Database Technology, pages 372-383, 2012.
J. Huang, D. J. Abadi, K. Ren. Scalable SPARQL Querying of Large RDF Graphs. Proc. VLDB Endow., 4(11): 1123-1134, 2011.
Y. Yan, C. Wang, A. Zhou, W. Qian, L. Ma, and Y. Pan. Efficient indices using graph partitioning in RDF triple stores. Proc. 25th International Conference on Data Engineering, pages 1263-1266, 2009.
T. Neumann and G. Weikum. Scalable join processing on very large RDF graphs. Proc. ACM SIGMOD International Conference on Management of Data, pages 627-640, 2009.
P. Cudré-Mauroux, I. Enchev, S. Fundatureanu, P. Groth, A. Haque, A. Harth, F. L. Keppmann, D. Miranker, J. F. Sequeda, M. Wylot. NoSQL Databases for RDF: An Empirical Evaluation. Proc. International Semantic Web Conference, pages 310-325, 2013.

Background papers

M. T. Özsu. A Survey of RDF Data Management Systems. Front. Comp. Sci., 10(3): 418-432.

L. Zou and M. T. Özsu. Graph-based RDF Data Management. Data Science and Engineering, 2:56-70, 2017.

K. Hose, R. Schenkel, M. Theobald, G. Weikum. Database Foundations for Scalable RDF Processing. In Reasoning Web. Semantic Technologies for the Web of Data, A. Polleres, C.d’Amato, M. Arenas, S. Handschuh, P. Kroner, S. Ossowski, and P. Patel-Schneider (eds), LNCS Volume 6848, pp 202-249, Springer, 2011.

P. Boncz, O. Erling, M-D. Pham. Advances in Large-Scale RDF Data Management. In Linked Open Data -- Creating Knowledge Out of Interlinked Data, S. Auer, V. Bryl, and S. Tramp (eds), LNCS Volume 8661, pages 21-44, Springer, 2014

G. Aluc, M. T. Özsu, K. Daudjee. Workload Matters: Why RDF Databases Need a New Design. Proc. VLDB Endow., 7(10): 837-840, 2014.

Z. Kaoudi and I. Manolescu. RDF in the Clouds: A Survey. The VLDB Journal, 2014.