Toward Robust Arabic AI-Generated Text Detection: Tackling Diacritics Challenges

Hamed Alshammari; Khaled Elleithy

doi:10.3390/info15070419

Information (Jul 2024)

Toward Robust Arabic AI-Generated Text Detection: Tackling Diacritics Challenges

Hamed Alshammari,
Khaled Elleithy

Affiliations

Hamed Alshammari: Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA
Khaled Elleithy: Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA

DOI: https://doi.org/10.3390/info15070419
Journal volume & issue: Vol. 15, no. 7
p. 419

Abstract

Read online

Current AI detection systems often struggle to distinguish between Arabic human-written text (HWT) and AI-generated text (AIGT) due to the small marks present above and below the Arabic text called diacritics. This study introduces robust Arabic text detection models using Transformer-based pre-trained models, specifically AraELECTRA, AraBERT, XLM-R, and mBERT. Our primary goal is to detect AIGTs in essays and overcome the challenges posed by the diacritics that usually appear in Arabic religious texts. We created several novel datasets with diacritized and non-diacritized texts comprising up to 9666 HWT and AIGT training examples. We aimed to assess the robustness and effectiveness of the detection models on out-of-domain (OOD) datasets to assess their generalizability. Our detection models trained on diacritized examples achieved up to 98.4% accuracy compared to GPTZero’s 62.7% on the AIRABIC benchmark dataset. Our experiments reveal that, while including diacritics in training enhances the recognition of the diacritized HWTs, duplicating examples with and without diacritics is inefficient despite the high accuracy achieved. Applying a dediacritization filter during evaluation significantly improved model performance, achieving optimal performance compared to both GPTZero and the detection models trained on diacritized examples but evaluated without dediacritization. Although our focus was on Arabic due to its writing challenges, our detector architecture is adaptable to any language.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords