Bridging the Kuwaiti Dialect Gap in Natural Language Processing

Fatemah Husain; Hana Alostad; Halima Omar

doi:10.1109/ACCESS.2024.3364367

IEEE Access (Jan 2024)

Bridging the Kuwaiti Dialect Gap in Natural Language Processing

Fatemah Husain,
Hana Alostad,
Halima Omar

Affiliations

Fatemah Husain: ORCiD; Information Science Department, College of Life Sciences, Sabah AlSalem University City (Alshadadiya), Kuwait University, Safat, Kuwait
Hana Alostad: ORCiD; Computer Science Department, College of Arts and Sciences, Gulf University for Science and Technology, Hawally, Kuwait
Halima Omar: Communication Disorders Science Department, College of Life Sciences, Sabah AlSalem University City (Alshadadiya), Kuwait University, Safat, Kuwait

DOI: https://doi.org/10.1109/ACCESS.2024.3364367
Journal volume & issue: Vol. 12
pp. 27709 – 27722

Abstract

Read online

The available dialectal Arabic linguistic resources are very limited in their coverage of Arabic dialects, particularly the Kuwaiti dialect. This shortage of linguistic resources creates struggles for researchers in the Natural Language Processing (NLP) field and limits the development of advanced linguistic analytical and processing tools for the Kuwaiti dialect. Many other low-resource Arabic dialects are still not explored in research due to the challenges faced during the annotators’ recruitment process for dataset labeling. This paper proposes a weak supervised classification system to solve the problem of recruiting human annotators called “q8SentiLabeler”. In addition, we developed a large dataset consisting of over 16.6k posts serving sentiment analysis in the Kuwaiti dialect. This dataset covers several themes and timeframes to remove any bias that might affect its content. Furthermore, we evaluated our dataset using multiple traditional machine-learning classifiers and advanced deep-learning language models to test its performance. Results demonstrate the positive potential of “q8SentiLabeler” to replace human annotators with a 93% for pairwise percent agreement and 0.87 for Cohen’s Kappa coefficient. Using the ARBERT model on our dataset, we achieved 89% accuracy in the system’s performance.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords