PhD Seminar • Data Systems — Predictable and Consistent Information Extraction

Wednesday, May 29, 2019 12:15 PM EDT

Besat Kassaie, PhD candidate
David R. Cheriton School of Computer Science

Information extraction programs are applied to documents to isolate structured versions of some content, that is, to create corresponding records in a relational table. In this work we introduce a new research problem related to information extraction process. We consider an extracted table as a generated view from input documents. More specifically, given an extraction algorithm and a document, the extraction algorithm produces a table T as if T was a view of the input document. In line with this perspective, we are interested in updating the input document to produce a modified table T'.

We were motivated to answer this question by the problem of applying privacy transformations to documents. Consider the problem of maintaining privacy for personal information contained in a collection of medical documents when publishing research results derived from those records. It has been shown that simply avoiding the publication of identifiers does not protect the privacy of individuals. The solution to this problem has been to apply differential privacy.

Our approach is to apply differential privacy to the table(s) obtained from a document collection through information extraction. The result is that the modified extracted tables can be published and analyzed by untrusted parties without fearing the loss of privacy for individuals. We wish to present to those researchers a set of documents that would have produced the modified table using the same information extraction procedure been applied to them.

We characterize extractors for which we are able to maintain consistency between input documents, the extractor, and modified tables. To this end, we introduce three properties for extractors. If a given extractor satisfies the proposed properties, we can guarantee that the consistency will be maintained. We propose a property verification process that uses static analysis for a substantial subset of JAPE, a well-established rule-based extraction language, and illustrate it through several examples.

Location

DC - William G. Davis Computer Research Centre

1304
200 University Avenue West
Waterloo, ON N2L 3G1
Canada