AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture

Hamed Alshammari; Ahmed El-Sayed; Khaled Elleithy

doi:10.3390/bdcc8030032

Big Data and Cognitive Computing (Mar 2024)

AI-Generated Text Detector for Arabic Language Using Encoder-Based Transformer Architecture

Hamed Alshammari,
Ahmed El-Sayed,
Khaled Elleithy

Affiliations

Hamed Alshammari: Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA
Ahmed El-Sayed: Department of Electrical and Computer Engineering, University of Bridgeport, Bridgeport, CT 06604, USA
Khaled Elleithy: Department of Computer Science and Engineering, University of Bridgeport, Bridgeport, CT 06604, USA

DOI: https://doi.org/10.3390/bdcc8030032
Journal volume & issue: Vol. 8, no. 3
p. 32

Abstract

Read online

The effectiveness of existing AI detectors is notably hampered when processing Arabic texts. This study introduces a novel AI text classifier designed specifically for Arabic, tackling the distinct challenges inherent in processing this language. A particular focus is placed on accurately recognizing human-written texts (HWTs), an area where existing AI detectors have demonstrated significant limitations. To achieve this goal, this paper utilized and fine-tuned two Transformer-based models, AraELECTRA and XLM-R, by training them on two distinct datasets: a large dataset comprising 43,958 examples and a custom dataset with 3078 examples that contain HWT and AI-generated texts (AIGTs) from various sources, including ChatGPT 3.5, ChatGPT-4, and BARD. The proposed architecture is adaptable to any language, but this work evaluates these models’ efficiency in recognizing HWTs versus AIGTs in Arabic as an example of Semitic languages. The performance of the proposed models has been compared against the two prominent existing AI detectors, GPTZero and OpenAI Text Classifier, particularly on the AIRABIC benchmark dataset. The results reveal that the proposed classifiers outperform both GPTZero and OpenAI Text Classifier with 81% accuracy compared to 63% and 50% for GPTZero and OpenAI Text Classifier, respectively. Furthermore, integrating a Dediacritization Layer prior to the classification model demonstrated a significant enhancement in the detection accuracy of both HWTs and AIGTs. This Dediacritization step markedly improved the classification accuracy, elevating it from 81% to as high as 99% and, in some instances, even achieving 100%.

Published in Big Data and Cognitive Computing

ISSN: 2504-2289 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology
Website: http://www.mdpi.com/journal/BDCC

About the journal

Abstract

Keywords