PhD Defence • Natural Language Processing | Information Retrieval • Design of Neural Models for Domain-Specific Retrieval-Based Information Access Systems

Friday, August 11, 2023 9:00 am - 12:00 pm EDT (GMT -04:00)

Please note: This PhD defence will take place online.

Yuqing Xie, PhD candidate
David R. Cheriton School of Computer Science

Supervisors: Professors Ming Li, Jimmy Lin

The ever-increasing volume of web-based documents poses a challenge in efficiently accessing specialized knowledge from domain-specific sources, requiring a profound understanding of the domain and substantial comprehension effort. Although natural language technologies, such as information retrieval and machine reading compression systems, offer rapid and accurate information retrieval, their performance in specific domains is hindered by training on general domain datasets. Creating domain-specific training datasets, while effective, is time-consuming, expensive, and heavily reliant on domain experts. This thesis presents a comprehensive exploration of efficient technologies to address the challenge of information access in specific domains, with a specific focus on retrieval-based systems encompassing question answering and ranking.

Initially, the structure of a retrieval-based information access system is demonstrated through a typical open-domain question answering task. An introduction to the retrieval-based system is provided, outlining its two major components: retrieval and reader models. Furthermore, a data augmentation method is introduced to adapt the reader model, originally trained on closed domain datasets, to effectively answer questions in the retrieval-based open-domain setting. The trade-offs associated with the retrieval model are discussed, and the best frontier in practice is presented.

Subsequently, a range of methods enabling system adaptation to specific domains is discussed. Transfer learning techniques, including further pre-training, question generation as data augmentation, and data clustering and selection, are presented. A comprehensive evaluation of these horizontal methods is conducted to assess their effectiveness in diverse specific domains. Moreover, the exploration extends to retrieval-based question answering systems beyond textual corpora. Specifically, the search system in the math domain, characterized by the unique role of formulas and distinct features compared to textual searches, is investigated.

Additionally, the search system for e-commerce database search is explored, wherein natural language queries are combined with user preference data to facilitate the retrieval of relevant products. To address the challenges of noisy labels and cold start problems in the retrieval-based e-commerce ranking system, model training is enhanced through cascaded training and adversarial sample weighting.

Finally, the research findings are summarized, and future research directions are discussed.