PhD Seminar • Artificial Intelligence — Computer Vision on Web Pages: A Study of Man-Made Images

Thursday, June 21, 2018 9:30 am - 9:30 am EDT (GMT -04:00)

Michael Cormier, PhD candidate
David R. Cheriton School of Computer Science

This thesis is focused on the development of computer vision techniques for parsing web pages using an image of the rendered page as evidence, and on understanding this under-explored class of images from the perspective of computer vision. This project is divided into two tracks — applied and theoretical — which complement each other. Our practical motivation is the application of improved web page parsing to assistive technology, such as screenreaders for visually impaired users or the ability to declutter the presentation of a web page for those with cognitive deficit. From a more theoretical standpoint, images of rendered web pages have interesting properties from a computer vision perspective; in particular, low-level assumptions can be made in this domain, but the most important cues are often subtle and can be highly non-local. The parsing system developed in this thesis is a principled Bayesian segmentation-classification pipeline, using innovative techniques to produce valuable results in this challenging domain. The thesis includes both implementation and evaluation solutions.

Segmentation of a web page is the problem of dividing it into semantically significant, visually coherent regions. We use a hierarchical segmentation method based on the detection of semantically significant lines (possibly broken lines) which divide regions. The Bayesian design allows sophisticated probability models to be applied to the segmentation process, and our method produces segmentation trees that achieve good performance on a variety of measures.

Classification, for our purposes, is identifying the semantic role of regions in the segmentation tree of a page. We achieve promising results with a Bayesian classification algorithm based on the novel use of a hidden Markov tree model, in which the structure of the model is adapted to reflect the structure of the segmentation tree. This allows the algorithm to make effective use of the context in which regions appear as well as the features of each individual region.

The methods used to evaluate our page parsing system include qualitative and quantitative evaluation of algorithm performance (using manually-prepared ground truth data) as well as a user study of an assistive interface based on our page segmentation algorithm. We also performed a separate user study to investigate users' perceptions of web page organization and to generate ground truth segmentations, leading to important insights about consistency.

Taken as a whole, this thesis presents innovative work in computer vision which contributes both to addressing the problem of web accessibility and to the understanding of semantic cues in images.