Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets

Ahmad Fathan Hidayatullah; Rosyzie Anna Apong; Daphne T.C. Lai; Atika Qazi

doi:10.7717/peerj-cs.1312

PeerJ Computer Science (Jun 2023)

Corpus creation and language identification for code-mixed Indonesian-Javanese-English Tweets

Ahmad Fathan Hidayatullah,
Rosyzie Anna Apong,
Daphne T.C. Lai,
Atika Qazi

Affiliations

Ahmad Fathan Hidayatullah: School of Digital Science, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei
Rosyzie Anna Apong: School of Digital Science, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei
Daphne T.C. Lai: School of Digital Science, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei
Atika Qazi: Centre for Lifelong Learning, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei

DOI: https://doi.org/10.7717/peerj-cs.1312
Journal volume & issue: Vol. 9
p. e1312

Abstract

Read online Read online

With the massive use of social media today, mixing between languages in social media text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-mixing. The prevalence of code-mixing exposes various concerns and challenges in natural language processing (NLP), including language identification (LID) tasks. This study presents a word-level language identification model for code-mixed Indonesian, Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-Javanese-English language identification (IJELID). To ensure reliable dataset annotation, we provide full details of the data collection and annotation standards construction procedures. Some challenges encountered during corpus creation are also discussed in this paper. Then, we investigate several strategies for developing code-mixed language identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results show that fine-tuned IndoBERTweet models can identify languages better than the other techniques. This is the result of BERT’s ability to understand each word’s context from the given text sequence. Finally, we show that sub-word language representation in BERT models can provide a reliable model for identifying languages in code-mixed texts.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords