PhD Defence • Information Retrieval | Natural Language Processing • Pretrained Transformers for Efficient and Robust Information Retrieval

Tuesday, August 6, 2024 9:30 am - 12:30 pm EDT (GMT -04:00)

Please note: This PhD defence will take place online.

Minghan Li, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Jimmy Lin

Pretrained transformers have significantly advanced the field of information retrieval (IR). Through unsupervised pretraining followed by task-aware finetuning, the learned representations of the pretrained transformer effectively capture the high-level and complex semantics of the text inputs. This new retrieval paradigm addresses the limitations of traditional bag-of-words systems such as BM25 by bridging the gap of lexical mismatch between queries and documents, marking milestones in several benchmark retrieval tasks.

Therefore, the pre-trained transformer models are also widely applied in different stages of a modern IR system which usually contains a first-stage retriever, a second-stage reranker, and an optional retrieval-augmented language model. Despite improving the first-stage retrieval recall and the second-stage reranking precision significantly, such neural IR systems and their vector databases are plagued by redundancy, resulting in increased search latency and excessive storage space. The second-stage reranker based on cross-encoder pretrained transformers can also lead to high computational costs when reranking a large candidate set of retrieved documents (e.g., reranking top-1000 documents).

In addition, the learned representations are lossy compression of the original inputs and tied to specific data distribution, making them vulnerable to out-of-distribution queries and low-level nuances. The computational overhead and lack-of-robustness issues can further affect downstream retrieval-augmented language models, causing trouble extracting useful information from scattered retrieved contents efficiently and encouraging hallucination when encountering adversarial queries that lead to retrieving biased contents.

This thesis summarizes our attempts to address the efficiency and robustness issues in most components of a modern IR system based on pretrained transformers, from first-stage retrieval, including dense, sparse, hybrid, and multi-vector retrievers, to second-stage cross-encoder reranking, as well as addressing the hallucination and attribution problems in the downstream retrieval-augmented large language models (LLMs).

Specifically, for first-stage dense retrieval models, we exploit index compression to improve search speed and reduce storage space through dimension reduction and product quantization. We detail our efforts in enhancing the robustness of dense retrieval models through techniques such as model ensembling and integration with sparse retrieval methods. To explore different retrieval model structures, we further extend these ideas to multi-vector retrieval systems, where we utilize dynamic lexical routing and token pruning to optimize efficiency and effectiveness jointly during finetuning the pretrained transformers on ranking tasks. We also investigate the possibility of incorporating sparse retrieval in a multi-vector system and propose sparsified late interactions for efficient multi-vector retrieval, making it directly compatible with traditional inverted indexes without sacrificing robustness while improving latency.

In the rest of the thesis, we continually explore the efficiency-robustness issues in second-stage rerankers based on cross-encoders that focus more on ranking precision. We introduce a simple strategy for cross-encoder rerankers by adding late interaction at the last layer to better handle out-of-distribution, long-sequence data with minimum latency cost. Another line of work examines the use of LLMs in query expansion for high-precision cross-encoder rerankers in conventional ranking tasks, where we found that traditional query expansion methods weaken the robustness of strong rerankers while our method manages to improve the ranking precision over the base model on out-of-domain data. After integrating the first-stage retrieval with second-stage reranking, we further propose a candidate-set pruning method with high-confidence error control to speed up the reranking process with reliability, allowing users to specify their goals for precision while providing a tradeoff between efficiency and robustness.

Finally, we attempt to use these optimized two-stage retrieval systems to address the hallucination and attribution problems in downstream retrieval-augmented LLMs, further boosting the integrated efficiency and robustness. We introduce a hybrid generation model that directly incorporates segments retrieved from the corpus into LLM outputs to reduce hallucinations and ensure precise attribution in instruction-following tasks. Our final system capitalizes on our efficient and robust solutions in neural retrieval, delivering precise search results swiftly. It also accelerates content generation by processing multiple tokens per time step, with a post-hoc revision mechanism to maintain a balance between attribution, robustness, and latency.

Overall, this thesis contributes to enhancing the efficiency and robustness of key components in a contemporary IR system and their integration with downstream LLMs in knowledge-intensive generation tasks.

Attend on Zoom.