# PhD Seminar • Artificial Intelligence (Machine Learning) — Semi-supervised Clustering for De-duplication

Wednesday, March 27, 2019 — 4:30 PM EDT

Shrinu Kushagra, PhD candidate
David R. Cheriton School of Computer Science

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem. We introduce a framework which we call promise correlation clustering. Given a complete graph $G$ with the edges labeled $0$ and $1$, the goal is to find a clustering that minimizes the number of $0$ edges within a cluster plus the number of $1$ edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph $G^*$ with edges corresponding to points in the same cluster being labeled $0$ and other edges being labeled $1$.

Under the promise that the edge difference between $G$ and $G^*$ is "small," we prove that finding the optimal clustering (or $G^*$) is still NP-Hard. Furthermore, the problem is NP-Hard (assuming the ETH-hypothesis) even when given access to an oracle (when the learner is allowed to interact with a domain expert, asking whether two given instances belong to the same cluster or not).

Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class $\mc F$ of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.

Location
DC - William G. Davis Computer Research Centre
2310
200 University Avenue West

Waterloo, ON N2L 3G1

### January 2022

S M T W T F S
26
27
28
29
30
31
1
2
3
5
7
8
9
11
12
15
16
22
23
25
26
28
29
30
1
2
3
4
5
1. 2022 (19)
1. February (1)
2. January (18)
2. 2021 (210)
1. December (21)
2. November (13)
3. October (12)
4. September (21)
5. August (20)
6. July (17)
7. June (11)
8. May (16)
9. April (27)
10. March (20)
11. February (13)
12. January (19)
3. 2020 (217)
4. 2019 (255)
5. 2018 (217)
6. 2017 (36)
7. 2016 (21)
8. 2015 (36)
9. 2014 (33)
10. 2013 (23)
11. 2012 (4)
12. 2011 (1)
13. 2010 (1)
14. 2009 (1)
15. 2008 (1)