Master’s Thesis Presentation • Data Systems — dstlr: Scalable Knowledge Graph Construction from Text Collections | Cheriton School of Computer Science

Friday, March 27, 2020 11:00 am - 11:00 am EDT (GMT -04:00)

Please note: This master’s thesis presentation will be given online.

Ryan Clancy, Master’s candidate
David R. Cheriton School of Computer Science

In recent years, the amount of data being generated for consumption by enterprises has increased exponentially. Enterprises typically work with structured data, but oftentimes the data being generated is semi-structured or unstructured in nature. In particular, there exists a wealth of unstructured text data (customer reviews, social media posts, news articles, etc.) containing information that could provide value to an organization. As data from different sources often reside in silos, a number of questions arise: How do we integrate the structured and unstructured data? How can we curate and refine the data? Can we do this at scale?

In this thesis, I present dstlr — a platform for scalable knowledge graph construction from text collections. I show how knowledge triples extracted from a collection of unstructured text documents can be used to form a knowledge graph, enabling integration of structured and unstructured data. Further, I show that linking triples to an existing knowledge graph enables rule-based data curation using the additional external information. I demonstrate this on a large collection of news articles, highlighting the horizontal scale-out of the system.

To join this Zoom meeting, please go to https://zoom.us/j/5199694213

Meeting ID: 519 969 4213