Pre-trained transformer-based language models for Sundanese

Wilson Wongso; Henry Lucky; Derwin Suhartono

doi:10.1186/s40537-022-00590-7

Journal of Big Data (Apr 2022)

Pre-trained transformer-based language models for Sundanese

Wilson Wongso,
Henry Lucky,
Derwin Suhartono

Affiliations

Wilson Wongso: Computer Science Department, School of Computer Science, Bina Nusantara University
Henry Lucky: Computer Science Department, School of Computer Science, Bina Nusantara University
Derwin Suhartono: Computer Science Department, School of Computer Science, Bina Nusantara University

DOI: https://doi.org/10.1186/s40537-022-00590-7
Journal volume & issue: Vol. 9, no. 1
pp. 1 – 17

Abstract

Read online

Abstract The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords