A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

Ahmad Fathan Hidayatullah; Atika Qazi; Daphne Teck Ching Lai; Rosyzie Anna Apong

doi:10.1109/ACCESS.2022.3223703

IEEE Access (Jan 2022)

A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

Ahmad Fathan Hidayatullah,
Atika Qazi,
Daphne Teck Ching Lai,
Rosyzie Anna Apong

Affiliations

Ahmad Fathan Hidayatullah: ORCiD; School of Digital Science, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
Atika Qazi: ORCiD; Centre for Lifelong Learning, Universiti Brunei Darussalam, Gadong BE, Brunei Darussalam
Daphne Teck Ching Lai: ORCiD; School of Digital Science, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
Rosyzie Anna Apong: School of Digital Science, Universiti Brunei Darussalam, Gadong, Brunei Darussalam

DOI: https://doi.org/10.1109/ACCESS.2022.3223703
Journal volume & issue: Vol. 10
pp. 122812 – 122831

Abstract

Read online

The mix of native language with other languages (code-mixing) in social media has posed a severe challenge for language identification (LID) systems. It has encouraged research on code-mixed LID solutions. Four things have been identified in this study, such as techniques, challenges, and dataset availability with corresponding quality criteria and developed a comprehensive framework for code-mixed LID. Also, we identified gaps and future work opportunities in tackling code-mixed LID challenges. Based on our analysis of reviewed studies, we outlined key points for future research in code-mixed LID. We demonstrated a taxonomy of applied techniques for code-mixed LID and highlighted the different technique variants. In code-mixed LID tasks, we discovered four significant challenges: ambiguity, lexical borrowing, non-standard words, and intra-word code-mixing. This systematic literature review recognised 32 code-mixed datasets available for LID. We proposed five features to describe the quality criteria datasets, such as the number of instances or sentences, percentage of code-mixed types in the data, number of tokens, number of unique tokens, and average sentence length. Finally, we synthesised the methodologies and proposed a conceptual framework for subsequent studies through our literature analysis.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords