Master’s Thesis Presentation • Machine Learning • AfriBERTa: Towards Viable Multilingual Language Models for Low-resource LanguagesExport this event to calendar

Tuesday, August 16, 2022 — 11:00 AM to 12:00 PM EDT

Please note: This master’s thesis presentation will take place online.

Kelechi Ogueji, Master’s candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Jimmy Lin

There are over 7000 languages spoken on earth, but many of these languages suffer from a dearth of natural language processing (NLP) tools. Multilingual pretrained language models have been introduced to help alleviate this problem. However, the largest pretrained multilingual models were trained on only hundreds of languages. This is a small amount when compared to the number of spoken languages. While these models have displayed impressive performance on several languages, including those they were not pretrained on, there is a lot of ground to be covered.

A lot of languages are often left out because pretrained language models are assumed to require a lot of training data, which the languages do not have. Furthermore, a major motivation behind these models is that such lower-resource languages benefit from joint training with higher-resource languages. In this thesis, we challenge both these assumptions and present the first attempt at training a multilingual language model on only low-resource languages. We show that it is possible to train competitive multilingual language models on less than one gigabyte of text data containing a selection of African languages.

Our model, named AfriBERTa, covers 11 African languages, including the first language model for 4 of these languages. We evaluate this model on named entity recognition and text classification spanning 10 languages. Our evaluation results show that our model outperforms larger multilingual models — multilingual BERT and XLM-RoBERTa — on several languages and is very competitive overall. Results suggest that our “small data” approach based on similar languages may sometimes work better than joint training on large datasets with high-resource languages. Furthermore, we present a comprehensive discussion of the implications of our findings.


To join this master’s thesis presentation on Zoom, please go to https://uwaterloo.zoom.us/j/98609377813?pwd=R1Awc0psL0FsMVlRczAwN2FpZHc2dz09.

Location 
Online master’s thesis presentation
200 University Avenue West

Waterloo, ON N2L 3G1
Canada
Event tags 

S M T W T F S
27
28
29
30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  1. 2024 (96)
    1. April (19)
    2. March (27)
    3. February (25)
    4. January (25)
  2. 2023 (296)
    1. December (20)
    2. November (28)
    3. October (15)
    4. September (25)
    5. August (30)
    6. July (30)
    7. June (22)
    8. May (23)
    9. April (32)
    10. March (31)
    11. February (18)
    12. January (22)
  3. 2022 (245)
  4. 2021 (210)
  5. 2020 (217)
  6. 2019 (255)
  7. 2018 (217)
  8. 2017 (36)
  9. 2016 (21)
  10. 2015 (36)
  11. 2014 (33)
  12. 2013 (23)
  13. 2012 (4)
  14. 2011 (1)
  15. 2010 (1)
  16. 2009 (1)
  17. 2008 (1)