Researchers at the Cheriton School of Computer Science have developed a data-efficient pretrained transformed-based neural language model to analyze 11 African languages. Their new neural network model, which they have dubbed AfriBERTa, is based on BERT — Bidirectional Encoder Representations from Transformers — a deep learning technique for natural language processing developed in 2018 by Google.
“Pretrained language models have transformed the way computers process and analyze textual data for tasks ranging from machine translation to question answering,” said Kelechi Ogueji, a master’s student in computer science at Waterloo. “Sadly, African languages have received little attention from the research community.”
Most of these models work using a technique known as pretraining. To accomplish this, the researchers present the model with text where some of the words had been covered up or “masked.” The model’s job is to guess the masked words. By repeating the process many billions of times, using graphics processing units (GPUs) in large data centres, the model learns statistical associations between words that mimic human understanding of language.
For resource-rich languages such as English, it’s easy to find lots of text to pretrain a BERT model. But this has been much more difficult for many other languages where comparable data are not as widely available. As a result, natural language processing capabilities have not been available to many parts of the world, including to large swaths of the African continent.
“One of the challenges is that neural networks are bewilderingly text- and computer-intensive to build,” Kelechi continued. “And unlike English, which has enormous quantities of available text, most of the 7,000 or so languages spoken worldwide can be characterized as low-resource, in that there is a lack of data available to feed data-hungry neural networks.”
Multilingual language models have tried to alleviate the data challenge by simultaneously pretraining on many languages, the conventional wisdom being that knowledge from high-resource languages can carry over to low-resource languages that share certain linguistic similarities. But these models still require lots of data. For example, one popular model called XLM-R is pretrained on 2.4 terabytes of data — more than 164 billion words of text. Even then, coverage of African languages has remained spotty.
AfriBERTa helps to fill this gap by enabling computers to analyze text of African languages for many useful tasks. The model specifically covers 11 languages, including Amharic, Hausa, and Swahili, spoken collectively by more than 400 million people. Importantly, AfriBERTa achieves output quality comparable to the best existing models despite learning from just one gigabyte of text, whereas other models require thousands of times more data. In particular, the researchers discovered that the transfer effects from high- to low-resource languages do not appear necessary to produce a good model.
“Being able to pretrain models that are just as accurate for certain downstream tasks, but using vastly smaller amounts of data has many advantages,” said Jimmy Lin, Cheriton Chair in Computer Science and Kelechi’s advisor. “Needing less data to train the language model means that less computation is required and consequently lower carbon emissions associated with operating massive data centres. Smaller datasets also make data curation more practical, which is one approach to reduce the biases present in the models.”
“This work takes a small but important step to bringing natural language processing capabilities to more than 1.3 billion people on the African continent,” Professor Lin concludes. “Our experiments also highlight important factors to consider when pretraining multilingual models on smaller datasets, presenting some guidance on model architecture that might be applicable to other low-resource languages as well.”
Assisting Kelechi Ogueji and Jimmy Lin in this research is Yuxin Zhu, who recently completed an undergraduate degree in computer science at Waterloo.
To learn more about the research on which this feature is based, please see Kelechi Ogueji, Yuxin Zhu, Jimmy Lin, “Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages.”
Their research is being presented at the First Workshop on Multilingual Representation Learning at EMNLP 2021, the 2021 Conference on Empirical Methods in Natural Language Processing. The AfriBERTa code, data and models are available at https://github.com/keleog/afriberta.