Please note: This PhD defence will take place online.
Negar Arabzadehghahyazi, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Charles Clarke
The rapid development of neural retrieval models and generative information-seeking systems has outpaced traditional evaluation methods, revealing critical gaps—especially when relevance labels are sparse. Current frameworks often fail to fairly compare retrieval and generation-based systems. Large language models (LLMs) further challenge conventional evaluation while offering new possibilities for automation.
This thesis first shows that sparse labeling introduces bias, underestimating strong models that retrieve relevant but unjudged documents. To address this, we propose a new evaluation method using Fréchet Distance, which improves robustness and enables comparison between retrieval and generative systems. We then explore the use of LLMs for evaluation, focusing on automated relevance judgments. We compare LLM-based methods, expose the lack of standardization, and propose a framework to assess these approaches based on alignment with human labels and impact on system rankings. We also highlight how prompt variations affect LLM evaluation consistency. Finally, we extend our analysis to evaluating generated content across tasks, including retrieval-assisted methods for text generation, IR-inspired evaluation for text-to-image models, and a broader framework for assessing LLM-powered applications. Together, these contributions advance evaluation methods for modern information access systems.