Please note: This PhD defence will take place online.
Xinyu Crystina Zhang, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jimmy Lin
Embedding models are a core component of modern information access systems, powering search engines, question answering, retrieval-augmented generation, and agentic search pipelines. By mapping text into vector spaces, they enable semantic matching beyond exact lexical overlap. However, progress in embedding-based retrieval has been highly uneven across languages. While English retrieval benefits from large-scale training data, mature benchmarks, and well-established modeling practices, many other languages still lack reliable resources, practical training guidance, and a clear understanding of how multilingual transfer works.
This thesis studies multilingual embedding models for retrieval from three connected perspectives: data, training, and understanding. First, it introduces two multilingual retrieval resources, Mr. TYDI and MIRACL, which provide large-scale benchmarks and supervision for evaluating and training retrieval models across diverse languages and scripts. Second, it investigates how to train effective multilingual dense retrievers under realistic resource conditions, studying the roles of pretrained backbones, translated and in-language data, multi-stage fine-tuning, cross-lingual transfer, knowledge distillation, and monolingual versus multilingual models. These experiments provide practical guidance for building retrieval systems when target-language resources are limited or unevenly distributed.
Finally, this thesis examines how multilingual language models understand and transfer across languages. Through analyses of shared tokens, multilingual vocabularies, and token-level semantic structures in embedding spaces, it studies the mechanisms that support or limit cross-lingual generalization. Overall, this work contributes datasets, training strategies, and model analyses that advance multilingual retrieval from infrastructure building to practical modeling and interpretation, with the broader goal of supporting more reliable information access across languages, scripts, and resource conditions.