Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model

Suha S. Al-Thanyyan; Aqil M. Azmi

Journal of King Saud University: Computer and Information Sciences (Sep 2023)

Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model

Suha S. Al-Thanyyan,
Aqil M. Azmi

Affiliations

Suha S. Al-Thanyyan: Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
Aqil M. Azmi: Corresponding author.; Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

Journal volume & issue: Vol. 35, no. 8
p. 101662

Abstract

Read online

The process of text simplification (TS) is crucial for enhancing the comprehension of written material, especially for people with low literacy levels and those who struggle to understand written content. In this study, we introduce the first automated approach to TS that combines word-level and sentence-level simplification techniques for Arabic text. We employ three models: a neural machine translation model, an Arabic-BERT-based lexical model, and a hybrid model that combines both methods to simplify the text. To evaluate the models, we created and utilized two Arabic datasets, namely EW-SEW and WikiLarge, comprising 82,585 and 249 sentence pairs, respectively. As resources were scarce, we made these datasets available to other researchers. The EW-SEW dataset is a commonly used English TS corpus that aligns each sentence in the original English Wikipedia (EW) with a simpler reference sentence from Simple English Wikipedia (SEW). In contrast, the WikiLarge dataset has eight simplified reference sentences for each of the 249 test sentences. The hybrid model outperformed the other models, achieving a BLEU score of 55.68, a SARI score of 37.15, and an FBERT score of 86.7% on the WikiLarge dataset, demonstrating the effectiveness of the combined approach.

Published in Journal of King Saud University: Computer and Information Sciences

ISSN: 1319-1578 (Print); 2213-1248 (Online)
Publisher: Springer
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://link.springer.com/journal/44443

About the journal

Abstract

Keywords