Journal of Universal Computer Science (Dec 2024)

Insights into Low-Resource Language Modelling: Improving Model Performances for South African Languages

  • Ruan Visser,
  • Trieko Grobler,
  • Marcel Dunaiski

DOI
https://doi.org/10.3897/jucs.118889
Journal volume & issue
Vol. 30, no. 13
pp. 1849 – 1871

Abstract

Read online Read online Read online

To address the gap in natural language processing for Southern African languages, our paper presents an in-depth analysis of language model development under resource-constrained conditions. We investigate the interplay between model size, pretraining objectives, and multilingual dataset composition in the context of low-resource languages such as Zulu and Xhosa. In our approach, we initially pretrain language models from scratch on specific low-resource languages using a variety of model configurations, and incrementally add related languages to explore the effect of additional languages on the performance of these models. We demonstrate that smaller data volumes can be effectively leveraged, and that the choice of pretraining objective and multilingual dataset composition significantly influences model performance. Our monolingual and multilingual models, exhibit competitive, and in some cases superior, performance compared to established multilingual models such as XLM-R-base and AfroXLM-R-base.

Keywords