Engineering Reports (Jan 2021)

Sentiment classification of skewed shoppers' reviews using machine learning techniques, examining the textual features

  • Mahdi Rezapour

DOI
https://doi.org/10.1002/eng2.12280
Journal volume & issue
Vol. 3, no. 1
pp. n/a – n/a

Abstract

Read online

Abstract With the speedy growth of online shopping, it has become of crucial importance for product makers to analyze, and handle a wealth of products' reviews. However, such a high volume of reviews, along with a wide variety of opinions, makes it hard for manufacturers to know exactly how they can improve their products without having an efficient approach. For this purpose, the results of sentiment classification would help the customers to retrieve the necessary information to choose an appropriate product, and the sellers to effectively collect customer feedback in order to improve their products. Like most of the read‐world problems, the shopping review data being used in this study were imbalanced, being predominately composed of positive with only a small percentage of negative reviews. Machine learning (ML) algorithms do not perform well when data are imbalanced, as they tend to get biased toward the overrepresented data category. The synthetic minority over‐sampling technique (SMOTE) was used to address this class imbalance problem. In this study, three different ML‐based algorithms, namely the Naïve Bayes (NB), Support Vector Machine, and decision tree (DT) were employed. An extensive preprocessing procedure was taken to prepare the text datasets, and details are discussed in the manuscript. The performance analysis indicated that the DT algorithm outperforms the other two methods. As positive reviews account for the majority of the reviews, sparse words removal for the data resulted in the removal of almost all negative reviews' sentiments. Hence, the model training process is here performed on positive and negative reviews separately. A combination of the review titles with their contents, separate tokenization process, applications of various N‐gram, and maintaining stops words (e.g. “not” or “but”) were some other steps considered to improve the performance of the model.

Keywords