A DEEP AUTOENCODER-BASED REPRESENTATION FOR ARABIC TEXT CATEGORIZATION

Fatima-Zahra El-Alami; Abdelkader El Mahdaouy; Said Ouatik El Alaoui; Noureddine En-Nahnahi

doi:10.32890/jict2020.19.3.4

Journal of ICT (Jun 2020)

A DEEP AUTOENCODER-BASED REPRESENTATION FOR ARABIC TEXT CATEGORIZATION

Fatima-Zahra El-Alami,
Abdelkader El Mahdaouy,
Said Ouatik El Alaoui,
Noureddine En-Nahnahi

Affiliations

Fatima-Zahra El-Alami: Laboratory of Informatics and Modeling, FSDM, Sidi Mohamed Ben Abdellah University, Morocco
Abdelkader El Mahdaouy: Laboratory of Informatics and Modeling, FSDM, Sidi Mohamed Ben Abdellah University, Morocco
Said Ouatik El Alaoui: Laboratory of Informatics and Modeling, FSDM, Sidi Mohamed Ben Abdellah University, Morocco & National School of Applied Sciences, Ibn Tofail University, Morocco
Noureddine En-Nahnahi: Laboratory of Informatics and Modeling, FSDM, Sidi Mohamed Ben Abdellah University, Morocco

DOI: https://doi.org/10.32890/jict2020.19.3.4
Journal volume & issue: Vol. 19, no. 3
pp. 381 – 398

Abstract

Read online

Arabic text representation is a challenging assignment for several applications such as text categorization and clustering since the Arabic language is known for its variety, richness and complex morphology. Until recently, the Bag-of-Words remains the most common method for Arabic text representation. However, it suffers from several shortcomings such as semantics deficiency and high dimensionality of feature space. Moreover, most existing methods ignore the explicit knowledge contained in semantic vocabularies such as Arabic WordNet. To overcome these shortcomings, we proposed a deep Autoencoder based representation for Arabic text categorization. It consisted of three stages: (1) Extracting from Arabic WordNet the most relevant concepts based on feature selection processes (2) Features learning via an unsupervised algorithm for text representation (3) Categorizing text using deep Autoencoder. Our method allowed for the consideration of document semantics by combining both implicit and explicit semantics and reducing feature space dimensionality. To evaluate our method, we conducted several experiments on the standard Arabic dataset, OSAC. The obtained results showed the effectiveness of the proposed method compared to state-of-the-art ones.

Published in Journal of ICT

ISSN: 1675-414X (Print); 2180-3862 (Online)
Publisher: UUM Press
Country of publisher: Malaysia
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://jict.uum.edu.my/

About the journal

Abstract

Keywords