Please note: This PhD defence will take place in DC 2314.
Sheng-Chieh (Jack) Lin, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jimmy Lin
First-stage retrieval plays a pivotal role in modern search engine, aiming to collect candidates from a large corpus for various downstream tasks, such as question answering, fact-checking and retrieval-augmented generation. Retrieval with heuristic lexical representations, BM25, has long been used as first-stage retriever. Recently, driven by the development of pre-trained transformers and approximate nearest neighbor search, transformer-based dense retrieval (DR) models have become a good alternative to BM25 since it solves the problem of term mismatch in lexical representations through semantic matching. However, unlike BM25, transformer-based DR models require huge amount of resource for training (e.g., human relevance labels and GPUs). Furthermore, even with enough training data, such DR models still fall short of BM25 in terms of robustness to different retrieval domains, tasks and languages, limiting the usefulness of such DR models. Alternatively, dense–sparse fusion and multi-vector bi-encoder retrieval models are able to capture both semantic and lexical features from text and have been demonstrated superior retrieval robustness. Nevertheless, such designs of retrieval model require more query latency and complex pipeline for indexing and retrieval, which makes it less favored to serve as a first-stage retrieval system. In order to deploy a robust and efficient first-stage retrieval system, in this thesis, we contribute to information retrieval in two directions. First, we propose efficient training methods to improve transformer-based DR models’ effectiveness. Second, we propose a dense representation framework, under which dense and sparse (semantic and lexical matching) indexing and retrieval can be simplified and efficiently performed.
In the beginning, we propose to improve the efficiency of a widely adopted training techniques for DR models: knowledge distillation. Instead of distilling the knowledge from the more powerful cross-encoder rankers, we propose to leverage the multi-vector bi-encoder model (i.e., ColBERT) to teach a DR model, coined TCT-ColBERT. Specifically, we propose to transfer the knowledge from a bi-encoder teacher to a student by distilling knowledge from ColBERT’s fine-grained scoring function into a simple dot product. The advantage of the bi-encoder teacher–student setup is that we can efficiently add in-batch negatives during knowledge distillation, enabling richer interactions between teacher and student models. We demonstrate that our approach yields better training efficiency and effectiveness compared to prior studies.
Subsequently, we propose a newly DR model design, which is able to capture both semantic and lexical features from a pre-trained transformer for robust dense retrieval. Unlike the existing DR models extracting semantic textual representations either from [CLS] or average pooling, our approach fully exploits the knowledge from pre-trained transformers by combining the semantic features from [CLS] and lexical features from masked language modeling. We demonstrate the advantages of our newly designed DR models, Aggretriever, over the existing DR models in terms of training efficiency and robustness. That is, Aggretriever achieves competitive retrieval effectiveness in both supervised and zero-shot evaluations while only requires single V100 (32GBs) GPU without any sophisticated and expensive training strategies, such as knowledge distillation and hard negative mining. In addition, we extend Aggretriever to multilingual DR models, coined mAggretriever, and demonstrate its superior zero-shot transferability in languages. That is, on multilingual benchmarks, mAggretriever, only fine-tuned on English training data, shows superior retrieval effectiveness compared to existing multilingual DR models, which rely on computationally expensive pre-training on large-scale multilingual training data crawled from the Web.
Despite of the success of DR models, we observe that existing DR models still lag behind multi-vector DR models and even BM25 in terms of robustness to various retrieval domains and tasks. Recent studies explore scaling the training of DR models in terms of training data and model size. However, the scaling requires to collect billion-scale training data from the web and increases the cost of training significantly. To improve the efficiency of training a robust DR model, we systematically conduct a study on DR training under the lens of data augmentation and find the key to fine-tuning a robust DR model: diverse query and label augmentation. As a result, we are the first to empirically demonstrate that a BERT-base-sized DR model (110M parameters), coined DRAGON, is able to compete the state-of-the-art multi-vector bi-encoder model and the DR model (4.8B parameters) with 40 times more parameters in both supervised and zero-shot evaluations. More importantly, rather than crawling additional billion-scale data from the web, our data augmentation only uses the 8.8M passages from MS MARCO corpus.
Finally, we discuss how to deploy a more effective hybrid retrieval model efficiently. To this end, we propose a dense representation framework for semantic and lexical matching. Any off-the-shelf dense (semantic) and sparse (lexical) features can be represented under the unified dense representation framework. The framework simplifies dense–sparse fusion with single index, and supports accelerated retrieval with GPUs and efficient approximate nearest neighbor search using existing libraries. In addition, this framework enables a newly designed bi-encoder hybrid retrieval model, (DeLADE+[CLS])DHR, which yields competitive retrieval effectiveness and reduces the cost at runtime compared to the existing bi-encoder multi-vector retrievers.
In the end, we summarize with conclusions and examine some future research directions. For example, we discuss the promising directions of leveraging large language models (LLMs) to build more robust retrieval systems, which support search planning and multimodal retrieval.