ARABIC TEXT CLASSIFICATION USING NEW STEMMER FOR FEATURE SELECTION AND DECISION TREES

SAID BAHASSINE; ABDELLAH MADANI; MOHAMED KISSI

Journal of Engineering Science and Technology (Jun 2017)

ARABIC TEXT CLASSIFICATION USING NEW STEMMER FOR FEATURE SELECTION AND DECISION TREES

SAID BAHASSINE,
ABDELLAH MADANI,
MOHAMED KISSI

Affiliations

SAID BAHASSINE: LIMA Laboratory, Department of Computer Science, Chouaib Doukkali University, Faculty of Science, B.P. 20, 24000, El Jadida, Morocco
ABDELLAH MADANI: LAROSERI Laboratory, Department of Computer Science, Chouaib Doukkali University, Faculty of Science, B.P. 20, 24000, El Jadida, Morocco
MOHAMED KISSI: LIM Laboratory, Department of Computer Science, HASSAN II University Casablanca, Faculty of Sciences and Technologies, B.P. 146, 20650, Mohammedia, Morocco

Journal volume & issue: Vol. 12, no. 6
pp. 1475 – 1487

Abstract

Read online

Text classification is the process of assignment of unclassified text to appropriate classes based on their content. The most prevalent representation for text classification is the bag of words vector. In this representation, the words that appear in documents often have multiple morphological structures, grammatical forms. In most cases, this morphological variant of words belongs to the same category. In the first part of this paper, anew stemming algorithm was developed in which each term of a given document is represented by its root. In the second part, a comparative study is conducted of the impact of two stemming algorithms namely Khoja’s stemmer and our new stemmer (referred to hereafter by origin-stemmer) on Arabic text classification. This investigation was carried out using chi-square as a feature of selection to reduce the dimensionality of the feature space and decision tree classifier. In order to evaluate the performance of the classifier, this study used a corpus that consists of 5070 documents independently classified into six categories: sport, entertainment, business, Middle East, switch and world on WEKA toolkit. The recall, f-measure and precision measures are used to compare the performance of the obtained models. The experimental results show that text classification using rout stemmer outperforms classification using Khoja’s stemmer. The f-measure was 92.9% in sport category and 89.1% in business category.

Published in Journal of Engineering Science and Technology

ISSN: 1823-4690 (Print)
Publisher: Taylor's University
Country of publisher: Malaysia
LCC subjects: Technology: Engineering (General). Civil engineering (General); Technology: Technology (General)
Website: http://jestec.taylors.edu.my/

About the journal

Abstract

Keywords