Please note: This PhD seminar will take place in DC 3301.
Nandan Thakur, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jimmy Lin
Retrieval systems face two major bottlenecks limiting progress: in-domain overfitting due to unrealistic benchmarks, and the scarcity of high-quality training data. To address the overfitting challenge, we introduced two benchmarks: BEIR, focused on “zero-shot” evaluation for out-of-domain accuracy, and MIRACL, designed to measure out-of-language accuracy. Retrieval-Augmented Generation (RAG) has since emerged as a method to extend the theoretical boundaries of parametric knowledge in LLMs by integrating external, up-to-date information through a retrieval stage before the LLM’s response. Yet real-world demands in RAG applications have shifted, and evaluation metrics and benchmarks must evolve to better capture retrieval diversity and recall.
I will present FreshStack, a benchmark with complex technical questions asked by users in niche programming domains, designed to minimize contamination from LLM pretraining corpora. To address the data scarcity challenge, I will talk about approaches such as GPL and SWIM-IR to generate high-quality and large synthetic datasets. In addition, I will discuss training data quality in RLHN, observing that more data is not always better, and relabeling false hard negatives curates the training data and improves out-of-distribution retrieval accuracy. Finally, I will share findings from the TREC 2024 RAG track, which investigates nugget- and support-based evaluation. This includes comparing human and LLM judges on whether answers contain the necessary facts and whether cited documents truly support those answers—while extending the evaluation framework across languages to measure for LLM hallucinations. I will conclude with a future vision for constructing complex benchmarks that support agentic retrieval systems, capable of decomposing and solving multi-step information-seeking tasks.