IEEE Access (Jan 2022)

A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

  • Ahmad Fathan Hidayatullah,
  • Atika Qazi,
  • Daphne Teck Ching Lai,
  • Rosyzie Anna Apong

DOI
https://doi.org/10.1109/ACCESS.2022.3223703
Journal volume & issue
Vol. 10
pp. 122812 – 122831

Abstract

Read online

The mix of native language with other languages (code-mixing) in social media has posed a severe challenge for language identification (LID) systems. It has encouraged research on code-mixed LID solutions. Four things have been identified in this study, such as techniques, challenges, and dataset availability with corresponding quality criteria and developed a comprehensive framework for code-mixed LID. Also, we identified gaps and future work opportunities in tackling code-mixed LID challenges. Based on our analysis of reviewed studies, we outlined key points for future research in code-mixed LID. We demonstrated a taxonomy of applied techniques for code-mixed LID and highlighted the different technique variants. In code-mixed LID tasks, we discovered four significant challenges: ambiguity, lexical borrowing, non-standard words, and intra-word code-mixing. This systematic literature review recognised 32 code-mixed datasets available for LID. We proposed five features to describe the quality criteria datasets, such as the number of instances or sentences, percentage of code-mixed types in the data, number of tokens, number of unique tokens, and average sentence length. Finally, we synthesised the methodologies and proposed a conceptual framework for subsequent studies through our literature analysis.

Keywords