Data Cleaning Books and Surveys

  1. Ihab F. Ilyas and Xu Chu, Data Cleaining, ACM Books
  2. Erhard Rahm, Hong Hai Do Data Cleaning: Problems and Current Approaches., IEEE Data Eng. Bull. 23(4): 3-13 (2000)
  3. Tamraparni Dasu, Theodore Johnson Exploratory Data Mining and Data Cleaning, John Wiley 2003, ISBN 0-471-26851-8
  4. Ihab F. Ilyas, Xu Chu Trends in Cleaning Relational Data , Foundations and Trends in Databases: Vol. 5: No. 4, pp 281-393

1. Error Detection

1.1 Constraints Language

  1. Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis, Conditional Functional Dependencies for Data Cleaning, ICDE 2007
  2. L. Bravo, W. Fan, and S. Ma, Extending dependencies with conditions, VLDB 2007
  3. Wenfei Fan, Shuai Ma, Yanli Hu, Jie Liu, Yinghui Wu, Propagating Functional Dependencies with Conditions, PVLDB 1(1), 2008
  4. Jiannan Wang, Nan Tang Towards dependable data repairing with fixing rules, SIGMOD 2014

1.2 Causality and Error Propagation/Explanation

  1. Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, Dan Suciu, Tracing data errors with view-conditioned causality SIGMOD 2011
  2. Eugene Wu, Samuel Madden, Scorpion: Explaining Away Outliers in Aggregate Queries, PVLDB 6(8), 2013
  3. Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and Paolo Papotti, Descriptive and Prescriptive Data Cleaning, SIGMOD 2014
  4. Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou, Data X-Ray: A Diagnostic Tool for Data Errors, SIGMOD 2015
  5. Xiaolan Wang, Alexandra Meliou, and Eugene Wu, QFix: Diagnosing errors through query histories, SIGMOD 2017

1.3 Solutions

  1. Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, and Nan Tang Detecting Data Errors: Where are we and what needs to be done?, PVLDB 9(12), 2016
  2. Pei Wang, Yeye He, Uni-Detect: A Unified Approach to Automated Error Detection in Tables, SIGMOD  2019
  3. Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang, Raha: A Configuration-Free Error Detection System, SIGMOD  2019
  4. Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, Theodoros Rekatsinas, HoloDetect: Few-Shot Learning for Error Detection, SIGMOD 2019

2. Constraints Discovery

2.1 FD and CFD

  1. Yka Huhtala, Juha Karkkainen, Pasi Porkka, Hannu Toivonen, TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies, The Computer Journal, Vol. 42, No. 2, 1999
  2. I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga, CORDS: Automatic discovery of correlations and soft functional dependencies, SIGMOD 2004
  3. L. Golab, H. J. Karloff, F. Korn, D. Srivastava, and B. Yu, On generating near-optimal tableaux for conditional functional dependencies, PVLDB 1(1), 2008
  4. F. Chiang and R. J. Miller, Discovering data quality rules, PVLDB 1(1), 2008
  5. W. Fan, F. Geerts, L. V. Lakshmanan, and M. Xiong, Discovering conditional functional dependencies, ICDE 2009
  6. Thorsten Papenbrock, Felix Naumann A Hybrid Approach to Functional Dependency Discovery, SIGMOD 2016

2.2 Keys

  1. Arvid Heise, Jorge-Arnulfo, Quiane-Ruiz, Ziawasch Abedjan, Anja Jentzsch, Felix Naumann Scalable Discovery of Unique Column Combinations, PVLDB 7(4), 2014

2.3 Denial Constraints

  1. Xu Chu, Ihab F. Ilyas, Paolo Papotti, Discovering Denial Constraints, PVLDB 6(13), 2013
  2. Tobias Bleifuß, Sebastian Kruse, Felix Naumann, Efficient Denial Constraint Discovery with Hydra. PVLDB 11(3): 311-323 (2017)

3. Data Repairing

3.1 Record Linkage

  1. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, Duplicate Record Detection: A Survey, IEEE Trans. Knowl. Data Eng.(2007)
  2. Sunita Sarawagi,Anuradha Bhamidipaty. Interactive deduplication using active learning, KDD 2002
  3. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, Interaction between record matching and data repairing, SIGMOD 2011
  4. Xu Chu, Ihab F. Ilyas, Paraschos Koutris, Distributed Data Deduplication, PVLDB 9(11), 2016
  5. Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, Vijay Raghavendra, Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018

3.2 Data Repairing of Constraints Violations

  1. P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, A cost-based model and effective heuristic for repairing constraints by value modification, SIGMOD 2005
  2. A. Lopatenko and L. Bravo, Efficient approximation algorithms for repairing inconsistent databases, ICDE 2007
  3. G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, Improving data quality: Consistency and accuracy, VLDB 2007
  4. Solmaz Kolahi, Laks V. S. Lakshmanan, On Approximating Optimum Repairs for Functional Dependency Violations, ICDT 2009
  5. Fei Chiang, Renee J. Miller, A Unified Model for Data and Constraint Repair, ICDE 2011
  6. Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang, Wenyuan Yu, Towards certain fixes with editing rules and master data, PVLDB 3(1), 2010
  7. Xu Chu, Ihab F. Ilyas, Paolo Papotti, Holistic data cleaning: Putting violations into context, ICDE 2013
  8. George Beskales, Ihab F. Ilyas, Lukasz Golab, Artur Galiullin, On the relative trust between inconsistent data and inaccurate constraints, ICDE 2013
  9. Maksims Volkovs, Fei Chiang, Jaroslaw Szlichta, Renée J. Miller Continuous data cleaning, ICDE 2014
  10. Nataliya Prokoshyna, Jaroslaw Szlichta, Fei Chiang, Renée J. Miller, Divesh Srivastava, Combining Quantitative and Logical Data Cleaning. PVLDB 9(4), 2015

3.3 ML-based models for Data Repairing

  1. Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan, Shan Xu, Data Curation at Scale: The Data Tamer System, CIDR 2013
  2. Theodoros Rekatsinas, Manas Joglekar, Hector Garcia-Molina, Aditya G. Parameswaran, Christopher Ré, SLiMFast: Guaranteed Results for Data Fusion and Source Reliability. SIGMOD 2017
  3. Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher Ré, HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10(11): 1190-1201 (2017)
  4. Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Eugene Wu, BoostClean, Automated Error Detection and Repair for Machine Learning, (Arxiv)
  5. Mohamed Yakout, Laure Berti-Equille, Ahmed K. Elmagarmid, Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes, SIGMOD 2013

3.4 Probabilistic and Model-based Data Repairing

  1. George Beskales, Mohamed A. Soliman, Ihab F. Ilyas, Shai Ben-David, Modeling and Querying Possible Repairs in Duplicate Detection, PVLDB 2(1), 2009
  2. George Beskales, Ihab F. Ilyas, Lukasz Golab, Sampling the Repairs of Functional Dependency Violations under Hard Constraints, PVLDB 3(1), 2010
  3. Jiannan Wang, Sanjay Krishnan, Michael Franklin, Ken Goldberg, Tim Kraska, Tova Milo. A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014
  4. Sanjay Krishnan, Jiannan Wang, Michael Franklin, Ken Goldberg, Tim Kraska Stale View Cleaning: Getting Fresh Answers From Stale Materialized Views. PVLDB 8(12), 2015.
  5. Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Eugene Wu. ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB 9(12), 2016
  6. Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Ré and Theodoros Rekatsinas, A Formal Framework For Probabilistic Unclean Databases (arXiv), ICDT 2019

3.5 User-Centric Data Repairing

  1. Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, Ihab F. Ilyas, Guided data repair, PVLDB 4(5), 2011
  2. Jiannan Wang, Tim Kraska, Michael J. Franklin, Jianhua Feng, CrowdER: Crowdsourcing Entity Resolution, PVLDB 5(11), 2012

4. Discovery and Knowledge Fusion

  1. Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge Vault: A Web-scale approach to probabilistic knowledge fusion, SIGKDD 2014
  2. Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, and Wei Zhang From data fusion to knowledge fusion, PVLDB 7(10), 2014
  3. Xin Luna Dong, Evgeniy Gabrilovich, Kevin Murphy, Van Dang, Wilko Horn, Camillo Lugaresi, Shaohua Sun, and Wei Zhang. Knowledge-based trust: estimating the trustworthiness of web sources, PVLDB 8(9), 2015
  4. Raul Castro Fernandez, Essam Mansour, Abdulhakim Qahtan, Ahmed Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang, Seeping Semantics: Linking Datasets using Word Embeddings for Data Discovery, ICDE 2018

5. Data Cleaning Systems

  1. Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, AJAX: An Extensible Data Cleaning Tool, SIGMOD 2000
  2. Vijayshankar Raman, Joseph M. Hellerstein, Potter's Wheel: An Interactive Data Cleaning System, VLDB 2001
  3. Amr Ebaid, Ahmed K. Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin, NADEEF: A Generalized Data Cleaning System, SIGMOD 2013
  4. Floris Geerts, Giansalvatore Mecca, Paolo Papotti, Donatello Santoro, The LLUNATIC Data-Cleaning Framework, PVLDB 6(9), 2013
  5. Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas, George Beskales, Mitch Cherniack, Stanley B. Zdonik, Alexander Pagan, Shan Xu, Data Curation at Scale: The Data Tamer System, CIDR 2013
  6. Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing, SIGMOD 2015
  7. Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge Quiané-Ruiz, Nan Tang, Si Yin BigDansing: A System for Big Data Cleansing, SIGMOD 2015