Master’s Thesis Presentation • Data Systems — Entity Matching and Disambiguation Across Multiple Knowledge Graphs | Cheriton School of Computer Science

Tuesday, May 21, 2019 9:30 am - 9:30 am EDT (GMT -04:00)

Michael Farag, Master’s candidate
David R. Cheriton School of Computer Science

Knowledge graphs are considered an important representation that lies between free text on one hand and fully-structured relational data on the other. Knowledge graphs are a backbone of many applications on the Web. With the rise of many large-scale open-domain knowledge graphs like Freebase, DBpedia, and Yago, various applications including document retrieval, question answering, and data integration have been relying on them.

In this thesis, we are primarily interested in knowledge graphs from the perspective of integrating disparate heterogeneous sources, with an eye towards applications such as document retrieval and question answering. Integrating different knowledge graphs is very important for enriching the knowledge shared among them. The core part of this integration process is matching entities across the knowledge graphs. The biggest challenge to entity matching is the ambiguity. The obvious solution is to make use of the graph structure and entity neighborhoods for matching and disambiguating entities.

We formalize the entity matching problem and present the first large-scale dataset, Ambiguous DBpedia-Wikidata, for this task based on exiting cross-ontology links between DBpedia and Wikidata, focused on several hundred thousand ambiguous entities.

We propose an entity matching framework that is capable of disambiguating entities across different knowledge graphs. The framework consists of fuzzy string matcher and graph embedding-based matcher. Using a classification-based approach, we find that a simple multi-layered perception based on representations derived from RDF2VEC graph embeddings of entities in each knowledge graph is sufficient to achieve high accuracy, with only limited training data. The contribution of our work is both a large dataset for examining this problem and strong baselines on which future work can be based.

We also present SimpleDBpediaQA, a new benchmark dataset for simple question answering over knowledge graphs that were created by mapping SimpleQuestions entities and predicates from Freebase to DBpedia. We show how entity matching using manual annotations can be used for migrating datasets across knowledge graphs.

Although this mapping is conceptually straightforward, there are a number of nuances that make the task non-trivial, owing to the different conceptual organizations of the two knowledge graphs.

Finally, if manual annotations are scarce, we show how our entity matching framework can be used to generate free annotations to train our model and then use it for disambiguation. In that essence, we introduce SimpleQuestions++, a new question answering benchmark that have all questions linked to Freebase, DBpedia, and Wikidata.

Location Information

Location Address: DC - William G. Davis Computer Research Centre
200 University Avenue West
2310
Waterloo, ON, CA N2L 3G1

Location coordinates: