PhD Seminar • Data Systems — A Semi-supervised Framework of Clustering Selection for De-duplication

Wednesday, December 9, 2020 12:00 pm - 12:00 pm EST (GMT -05:00)

Please note: This PhD seminar will be given online.

Hemant Saxena, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Ihab Ilyas

We view data de-duplication as a clustering problem. A recent work introduced a framework called restricted correlation clustering (RCC) to model de-duplication problems. Given a set $X$, an unknown target clustering $C^*$ of $X$ and a class $F$ of clusterings of $X$, the goal is to find a clustering $C$ from the set $F$ which minimizes the correlation loss. The clustering algorithm is allowed to interact with a domain expert by asking whether a pair of records correspond to the same entity or not. Main drawback of the algorithm in RCC is that the pre-processing step has a time complexity of $\Theta(|X|^2)$ (where $X$ is the input set).

In this work, we make the following contributions. We develop a sampling procedure (based on locality sensitive hashing) which requires a linear pre-processing time $O(|X|)$. We prove that our sampling procedure can estimate the correlation loss of all clusterings in $F$ using only a small number of labelled examples. In fact, the number of labelled examples is independent of $|X|$ and depends only on the complexity of the class $F$. Further we show that to sample one pair, with high probability our procedure makes a constant number of queries to the domain expert. We then perform an extensive empirical evaluation of our approach which shows the efficiency of our method.


To join this PhD seminar on Zoom, please go to https://us02web.zoom.us/j/88527315254?pwd=LzNtOGxEenA0NE5GZ29wUDJNaHJxdz09.

Please note: This link was updated at 9:00 a.m. on Wednesday, December 9.