PeerJ Computer Science (Jun 2023)

Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets

  • Ahmad Fathan Hidayatullah,
  • Rosyzie Anna Apong,
  • Daphne T.C. Lai,
  • Atika Qazi

DOI
https://doi.org/10.7717/peerj-cs.1312
Journal volume & issue
Vol. 9
p. e1312

Abstract

Read online Read online

With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT’s ability to understand each word’s context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts.

Keywords