PhD Seminar • Artificial Intelligence (Machine Learning) — Semi-supervised Clustering for De-duplication

Wednesday, March 27, 2019 4:30 pm - 4:30 pm EDT (GMT -04:00)

Shrinu Kushagra, PhD candidate
David R. Cheriton School of Computer Science

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem. We introduce a framework which we call promise correlation clustering. Given a complete graph $G$ with the edges labeled $0$ and $1$, the goal is to find a clustering that minimizes the number of $0$ edges within a cluster plus the number of $1$ edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph $G^*$ with edges corresponding to points in the same cluster being labeled $0$ and other edges being labeled $1$.

Under the promise that the edge difference between $G$ and $G^*$ is "small," we prove that finding the optimal clustering (or $G^*$) is still NP-Hard. Furthermore, the problem is NP-Hard (assuming the ETH-hypothesis) even when given access to an oracle (when the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not).

Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class $\mc F$ of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.