Synthetic data with neural machine translation for automatic correction in arabic grammar

Aiman Solyman; Wang Zhenyu; Tao Qian; Arafat Abdulgader Mohammed Elhag; Muhammad Toseef; Zeinab Aleibeid

Egyptian Informatics Journal (Sep 2021)

Synthetic data with neural machine translation for automatic correction in arabic grammar

Aiman Solyman,
Wang Zhenyu,
Tao Qian,
Arafat Abdulgader Mohammed Elhag,
Muhammad Toseef,
Zeinab Aleibeid

Affiliations

Aiman Solyman: School of Software Engineering, South China University of Technology, Guangzhou, China
Wang Zhenyu: School of Software Engineering, South China University of Technology, Guangzhou, China; Corresponding author.
Tao Qian: School of Software Engineering, South China University of Technology, Guangzhou, China
Arafat Abdulgader Mohammed Elhag: Department of Information System, Bisha Community College, University of Bisha, Saudi Arabia
Muhammad Toseef: School of Software Engineering, South China University of Technology, Guangzhou, China
Zeinab Aleibeid: School of Computer Science, Wuhan University of Technology, Wuhan, China

Journal volume & issue: Vol. 22, no. 3
pp. 303 – 315

Abstract

Read online

The automatic correction of grammar and spelling errors is important for students, second language learners, and some Natural Language Processing (NLP) tasks such as part of speech and text summarization. Recently, Neural Machine Translation (NMT) has been an out-performing and well-established model in the task of Grammar Error Correction (GEC). Arabic GEC is still growing because of some challenges, such as scarcity of training sets and the complexity of Arabic language. To overcome these issues, we introduced an unsupervised method to generate large-scale synthetic training data based on confusion function to increase the amount of training set. Furthermore, we introduced a supervised NMT model for AGEC called SCUT AGEC. SCUT AGEC is a convolutional sequence-to-sequence model consisting of nine encoder-decoder layers with attention mechanism. We applied fine-tuning to improve the performance and get more efficient results. Convolutional Neural Networks (CNN) gives our model ability to joint feature extraction and classification in one task and we proved that it is an efficient way to capture features of the local context. Moreover, it is easy to obtain long-term dependencies because of convolutional layers staking. Our proposed model becomes the first supervised AGEC system based on the convolutional sequence-to-sequence learning to outperforms the current state-of-the-art neural AGEC models.

Published in Egyptian Informatics Journal

ISSN: 1110-8665 (Print)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.sciencedirect.com/journal/egyptian-informatics-journal

About the journal

Abstract

Keywords