Effect of stemming on text similarity for Arabic language at sentence level

Mohammad O. Alhawarat; Hikmat Abdeljaber; Anwer Hilal

doi:10.7717/peerj-cs.530

PeerJ Computer Science (May 2021)

Effect of stemming on text similarity for Arabic language at sentence level

Mohammad O. Alhawarat,
Hikmat Abdeljaber,
Anwer Hilal

Affiliations

Mohammad O. Alhawarat: Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Alkharj, Saudi Arabia
Hikmat Abdeljaber: Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Alkharj, Saudi Arabia
Anwer Hilal: General Department, College of Preparatory Year, Prince Sattam Bin Abdulaziz University, Alkharj, Saudi Arabia

DOI: https://doi.org/10.7717/peerj-cs.530
Journal volume & issue: Vol. 7
p. e530

Abstract

Read online Read online

Semantic Text Similarity (STS) has several and important applications in the field of Natural Language Processing (NLP). The Aim of this study is to investigate the effect of stemming on text similarity for Arabic language at sentence level. Several Arabic light and heavy stemmers as well as lemmatization algorithms are used in this study, with a total of 10 algorithms. Standard training and testing data sets are used from SemEval-2017 international workshop for Task 1, Track 1 Arabic (ar–ar). Different features are selected to study the effect of stemming on text similarity based on different similarity measures. Traditional machine learning algorithms are used such as Support Vector Machines (SVM), Stochastic Gradient Descent (SGD) and Naïve Bayesian (NB). Compared to the original text, using the stemmed and lemmatized documents in experiments achieve enhanced Pearson correlation results. The best results attained when using Arabic light Stemmer (ARLSTem) and Farasa light stemmers, Farasa and Qalsadi Lemmatizers and Tashaphyne heavy stemmer. The best enhancement was about 7.34% in Pearson correlation. In general, stemming considerably improves the performance of sentence text similarly for Arabic language. However, some stemmers make results worse than those for original text; they are Khoja heavy stemmer and AlKhalil light stemmer.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords