Shrinu
Kushagra,
PhD
candidate
David
R.
Cheriton
School
of
Computer
Science
Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem. We introduce a framework which we call promise correlation clustering. Given a complete graph $G$ with the edges labeled $0$ and $1$, the goal is to find a clustering that minimizes the number of $0$ edges within a cluster plus the number of $1$ edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph $G^*$ with edges corresponding to points in the same cluster being labeled $0$ and other edges being labeled $1$.
Under the promise that the edge difference between $G$ and $G^*$ is "small," we prove that finding the optimal clustering (or $G^*$) is still NP-Hard. Furthermore, the problem is NP-Hard (assuming the ETH-hypothesis) even when given access to an oracle (when the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not).
Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class $\mc F$ of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.