Please note: This master’s thesis presentation will take place online.
Mofetoluwa Adeyemi, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jimmy Lin
Web resources are becoming more available in various languages, increasing the importance of cross-lingual information retrieval (CLIR) in accessing information that is present in a different language. To support CLIR studies, test collections are actively curated in the information retrieval (IR) field for the evaluation of methods and systems. Resources which support the evaluation of CLIR for African languages exist, however, these resources are few and are mostly curated synthetically or through translation, making them biased towards certain retrieval methods or prone to “Translationese” issues. Current resources also have document collections collected from sources with scarce resources for African languages, potentially limiting the provision of documents relevant to a search query. To address these, we present CIRAL, a test collection covering retrieval between English and four African languages: Hausa, Somali, Swahili and Yoruba. With its corpora developed from African news and blogs, which are a rich source of textual data for these languages, CIRAL was formulated for the passage ranking task with queries in English and passages in the African languages. Native speakers of the African languages develop the queries and provide query-passage relevance assessment. As often done in IR to curate test collections and promote research participation in CLIR, CIRAL was hosted as a shared task at the Forum for Information Retrieval and Evaluation (FIRE) 2023, where pools were collected for a subset of the collection.
In this thesis, we provide a detailed description of CIRAL as a body of work, covering its curation process and shared task. Additionally, we conduct retrieval and reranking experiments, evaluating the effectiveness of systems in CLIR for African languages and demonstrating the utility of CIRAL. These include BM25 baselines with query and document translations and dense retrieval baselines with multilingual dense passage retrievers. We also examine the zero-shot reranking capabilities of T5 cross-encoder models and Large Language Models (LLMs) such as GPT and Zephyr in CLIR for African languages. We hope CIRAL fosters CLIR evaluation and research in African languages, and hence the development of retrieval systems that are well-suited for such tasks.