Data Cleaning Books and Surveys
- Ihab F. Ilyas and Xu Chu, Data Cleaining, ACM Books
- Erhard Rahm, Hong Hai Do Data
Cleaning: Problems and Current Approaches., IEEE
Data Eng. Bull. 23(4): 3-13 (2000)
- Tamraparni Dasu, Theodore Johnson Exploratory
Data Mining and Data Cleaning, John Wiley 2003,
ISBN 0-471-26851-8
- Ihab F. Ilyas, Xu Chu Trends
in Cleaning Relational Data , Foundations and
Trends in Databases: Vol. 5: No. 4, pp 281-393
1. Error Detection
1.1 Constraints Language
- Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei
Jia, Anastasios Kementsietsidis, Conditional
Functional Dependencies for Data Cleaning, ICDE
2007
- L. Bravo, W. Fan, and S. Ma, Extending dependencies with
conditions, VLDB 2007
- Wenfei Fan, Shuai Ma, Yanli Hu, Jie Liu, Yinghui
Wu, Propagating Functional Dependencies
with Conditions, PVLDB 1(1), 2008
- Jiannan Wang, Nan Tang Towards dependable data repairing
with fixing rules, SIGMOD 2014
1.2 Causality and Error Propagation/Explanation
- Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath,
Dan Suciu, Tracing
data errors with view-conditioned causality
SIGMOD 2011
- Eugene Wu, Samuel Madden, Scorpion:
Explaining Away Outliers in Aggregate Queries,
PVLDB 6(8), 2013
- Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, and
Paolo Papotti, Descriptive
and Prescriptive Data Cleaning, SIGMOD 2014
- Xiaolan Wang, Xin Luna Dong, and Alexandra Meliou,
Data X-Ray: A Diagnostic Tool for Data Errors,
SIGMOD 2015
- Xiaolan Wang, Alexandra Meliou, and Eugene Wu,
QFix: Diagnosing errors through query histories,
SIGMOD 2017
1.3 Solutions
- Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro
Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo
Papotti, Michael Stonebraker, and Nan Tang Detecting
Data Errors: Where are we and what needs to be done?,
PVLDB 9(12), 2016
- Pei Wang, Yeye He, Uni-Detect:
A Unified Approach to Automated Error Detection in
Tables, SIGMOD 2019
- Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro
Fernandez, Samuel Madden, Mourad Ouzzani, Michael
Stonebraker, Nan Tang, Raha:
A Configuration-Free Error Detection System,
SIGMOD 2019
- Alireza Heidari, Joshua McGrath, Ihab F. Ilyas,
Theodoros Rekatsinas, HoloDetect:
Few-Shot Learning for Error Detection, SIGMOD
2019
2. Constraints Discovery
2.1 FD and CFD
- Yka Huhtala, Juha Karkkainen, Pasi Porkka, Hannu
Toivonen, TANE: An Efficient Algorithm for
Discovering Functional and Approximate Dependencies,
The Computer Journal, Vol. 42, No. 2, 1999
- I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A.
Aboulnaga, CORDS: Automatic discovery of
correlations and soft functional dependencies,
SIGMOD 2004
- L. Golab, H. J. Karloff, F. Korn, D. Srivastava,
and B. Yu, On generating near-optimal tableaux
for conditional functional dependencies, PVLDB
1(1), 2008
- F. Chiang and R. J. Miller, Discovering data quality rules,
PVLDB 1(1), 2008
- W. Fan, F. Geerts, L. V. Lakshmanan, and M. Xiong,
Discovering conditional functional
dependencies, ICDE 2009
- Thorsten Papenbrock, Felix Naumann A
Hybrid Approach to Functional Dependency Discovery,
SIGMOD 2016
2.2 Keys
- Arvid Heise, Jorge-Arnulfo, Quiane-Ruiz, Ziawasch
Abedjan, Anja Jentzsch, Felix Naumann Scalable
Discovery of Unique Column Combinations, PVLDB
7(4), 2014
2.3 Denial Constraints
- Xu Chu, Ihab F. Ilyas, Paolo Papotti, Discovering Denial Constraints,
PVLDB 6(13), 2013
- Tobias Bleifuß, Sebastian Kruse, Felix Naumann,
Efficient Denial Constraint Discovery with Hydra.
PVLDB 11(3): 311-323 (2017)
3. Data Repairing
3.1 Record Linkage
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis,
Vassilios S. Verykios, Duplicate Record Detection: A Survey,
IEEE Trans. Knowl. Data Eng.(2007)
- Sunita Sarawagi,Anuradha Bhamidipaty. Interactive deduplication using
active learning, KDD 2002
- W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, Interaction between record matching
and data repairing, SIGMOD 2011
- Xu Chu, Ihab F. Ilyas, Paraschos Koutris, Distributed Data Deduplication,
PVLDB 9(11), 2016
- Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai
Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep,
Esteban Arcaute, Vijay Raghavendra, Deep
Learning for Entity Matching: A Design Space
Exploration. SIGMOD 2018
3.2 Data Repairing of Constraints Violations
- P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, A cost-based model and effective
heuristic for repairing constraints by value
modification, SIGMOD 2005
- A. Lopatenko and L. Bravo, Efficient approximation algorithms
for repairing inconsistent databases, ICDE 2007
- G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, Improving data quality: Consistency
and accuracy, VLDB 2007
- Solmaz Kolahi, Laks V. S. Lakshmanan, On Approximating Optimum Repairs
for Functional Dependency Violations, ICDT 2009
- Fei Chiang, Renee J. Miller, A Unified Model for Data and
Constraint Repair, ICDE 2011
- Wenfei Fan, Jianzhong Li, Shuai Ma, Nan Tang,
Wenyuan Yu, Towards certain fixes with editing
rules and master data, PVLDB 3(1), 2010
- Xu Chu, Ihab F. Ilyas, Paolo Papotti, Holistic data cleaning: Putting
violations into context, ICDE 2013
- George Beskales, Ihab F. Ilyas, Lukasz Golab, Artur
Galiullin, On the relative trust between
inconsistent data and inaccurate constraints,
ICDE 2013
- Maksims Volkovs, Fei Chiang, Jaroslaw Szlichta,
Renée J. Miller Continuous data cleaning, ICDE
2014
- Nataliya Prokoshyna, Jaroslaw Szlichta, Fei Chiang,
Renée J. Miller, Divesh Srivastava, Combining
Quantitative and Logical Data Cleaning. PVLDB
9(4), 2015
3.3 ML-based models for Data Repairing
- Michael Stonebraker, Daniel Bruckner, Ihab F. Ilyas,
George Beskales, Mitch Cherniack, Stanley B. Zdonik,
Alexander Pagan, Shan Xu, Data Curation at Scale: The Data
Tamer System, CIDR 2013
- Theodoros Rekatsinas, Manas Joglekar, Hector
Garcia-Molina, Aditya G. Parameswaran, Christopher Ré,
SLiMFast:
Guaranteed Results for Data Fusion and Source
Reliability. SIGMOD 2017
- Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas,
Christopher Ré, HoloClean:
Holistic Data Repairs with Probabilistic Inference.
PVLDB 10(11): 1190-1201 (2017)
- Sanjay Krishnan, Michael J. Franklin, Ken Goldberg,
Eugene Wu, BoostClean,
Automated Error Detection and Repair for Machine
Learning, (Arxiv)
- Mohamed Yakout, Laure Berti-Equille, Ahmed K.
Elmagarmid, Don't be SCAREd: use SCalable
Automatic REpairing with maximal likelihood and
bounded changes, SIGMOD 2013
3.4 Probabilistic and Model-based Data Repairing
- George Beskales, Mohamed A. Soliman, Ihab F. Ilyas,
Shai Ben-David, Modeling and Querying Possible
Repairs in Duplicate Detection, PVLDB 2(1), 2009
- George Beskales, Ihab F. Ilyas, Lukasz Golab, Sampling the Repairs of Functional
Dependency Violations under Hard Constraints,
PVLDB 3(1), 2010
- Jiannan Wang, Sanjay Krishnan, Michael Franklin, Ken
Goldberg, Tim Kraska, Tova Milo.
A Sample-and-Clean Framework for Fast and Accurate
Query Processing on Dirty Data. SIGMOD 2014
- Sanjay Krishnan, Jiannan Wang, Michael Franklin, Ken
Goldberg, Tim Kraska Stale
View Cleaning: Getting Fresh Answers From Stale
Materialized Views. PVLDB 8(12), 2015.
- Sanjay Krishnan, Jiannan Wang, Michael J. Franklin,
Ken Goldberg, Eugene Wu.
ActiveClean: Interactive Data Cleaning For
Statistical Modeling. PVLDB 9(12), 2016
- Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld,
Christopher Ré and Theodoros Rekatsinas,
A Formal Framework For Probabilistic Unclean
Databases (arXiv),
ICDT 2019
3.5 User-Centric Data Repairing
- Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer
Neville, Mourad Ouzzani, Ihab F. Ilyas, Guided data repair, PVLDB 4(5),
2011
- Jiannan Wang, Tim Kraska, Michael J. Franklin,
Jianhua Feng, CrowdER: Crowdsourcing Entity
Resolution, PVLDB 5(11), 2012
4. Discovery and Knowledge Fusion
- Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz,
Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann,
Shaohua Sun, and Wei Zhang. Knowledge
Vault: A Web-scale approach to probabilistic
knowledge fusion, SIGKDD 2014
- Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz,
Wilko Horn, Kevin Murphy, Shaohua Sun, and Wei
Zhang From
data fusion to knowledge fusion, PVLDB 7(10),
2014
- Xin Luna Dong, Evgeniy Gabrilovich, Kevin Murphy,
Van Dang, Wilko Horn, Camillo Lugaresi, Shaohua Sun,
and Wei Zhang.
Knowledge-based trust: estimating the
trustworthiness of web sources, PVLDB 8(9), 2015
- Raul Castro Fernandez, Essam Mansour, Abdulhakim
Qahtan, Ahmed Elmagarmid, Ihab F. Ilyas, Samuel
Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang,
Seeping
Semantics: Linking Datasets using Word Embeddings
for Data Discovery, ICDE 2018
5. Data Cleaning Systems
- Helena Galhardas, Daniela Florescu, Dennis Shasha,
Eric Simon,
AJAX: An Extensible Data Cleaning Tool, SIGMOD
2000
- Vijayshankar Raman, Joseph M. Hellerstein,
Potter's Wheel: An Interactive Data Cleaning System,
VLDB 2001
- Amr Ebaid, Ahmed K. Elmagarmid, Ihab F. Ilyas,
Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang,
Si Yin, NADEEF: A Generalized Data Cleaning
System, SIGMOD 2013
- Floris Geerts, Giansalvatore Mecca, Paolo Papotti,
Donatello Santoro, The LLUNATIC Data-Cleaning
Framework, PVLDB 6(9), 2013
- Michael Stonebraker, Daniel Bruckner, Ihab F.
Ilyas, George Beskales, Mitch Cherniack, Stanley B.
Zdonik, Alexander Pagan, Shan Xu, Data Curation at Scale: The Data
Tamer System, CIDR 2013
- Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani,
Paolo Papotti, Nan Tang, Yin Ye KATARA:
A Data Cleaning System Powered by Knowledge Bases
and Crowdsourcing, SIGMOD 2015
- Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel
Madden, Mourad Ouzzani, Paolo Papotti, Jorge
Quiané-Ruiz, Nan Tang, Si Yin BigDansing:
A System for Big Data Cleansing, SIGMOD 2015
|