Data Systems Seminar Series • Efficient Query Processing and Learning on Dirty Data

Wednesday, March 18, 2026 10:30 am - 11:30 am EDT (GMT -04:00)

Please note: This seminar will take place in DC 1304.

Boris Glavic, Associate Professor
Department of Computer Science, University of Illinois at Chicago

Data quality issues such as missing values, constraint violations, and outliers are prevalent in most real-world datasets. While the database community has developed a rich toolbox for addressing such data errors, the dominant practice is still to select a single “best-guess” repair that is then treated as gospel. However, the ground truth clean version of a dirty dataset is often unavailable, expensive to collect, or fundamentally non-identifiable from available observations. Seemingly reasonable cleaning choices embed hard-to-validate assumptions that, if violated, can lead to erroneous and misleading analysis outcomes. As the ground truth clean version of a dirty dataset is typically not recoverable, to trust any result of a computation over dirty data, it is necessary to reason about all possible repairs and derive sound bounds on the possible outcomes of the computation to certify its robustness or demonstrate that it is fundamentally too uncertain to be trusted. Unfortunately, this is computationally hard, even for relatively simple classes of computations and limited types of data errors.

In this talk, I will provide an overview of our work on lightweight models for uncertain data that enable the compact representation of the set of all feasible repairs for a dirty dataset for a wide range of data quality issues. Our work is the first to provide efficient support for evaluating complex relational queries as well as machine learning training and inference over uncertain data


Bio: Boris Glavic is an Associate Professor in the Department of Computer Science at the University of Illinois at Chicago, leading UIC’s DBGroup. His research spans several areas of database systems and data science, including data provenance, query execution and optimization, uncertain data, systems for ML, and data integration and cleaning.