A MapReduce Opinion Mining for COVID-19-Related Tweets Classification Using Enhanced ID3 Decision Tree Classifier

Fatima Es-Sabery; Khadija Es-Sabery; Junaid Qadir; Beatriz Sainz-De-Abajo; Abdellatif Hair; Begona Garcia-Zapirain; Isabel De La Torre-Diez

doi:10.1109/ACCESS.2021.3073215

IEEE Access (Jan 2021)

A MapReduce Opinion Mining for COVID-19-Related Tweets Classification Using Enhanced ID3 Decision Tree Classifier

Fatima Es-Sabery,
Khadija Es-Sabery,
Junaid Qadir,
Beatriz Sainz-De-Abajo,
Abdellatif Hair,
Begona Garcia-Zapirain,
Isabel De La Torre-Diez

Affiliations

Fatima Es-Sabery: ORCiD; Department of Computer Science, Faculty of Sciences and Technology, Sultan Moulay Slimane University, Beni Mellal, Morocco
Khadija Es-Sabery: ORCiD; Department of Computer Science, National School of Applied Sciences, Cadi Ayyad University, Marrakech, Morocco
Junaid Qadir: ORCiD; Department of Electronics, Quaid-i-Azam University, Islamabad, Pakistan
Beatriz Sainz-De-Abajo: ORCiD; Department of Signal Theory, Communications and Telematics Engineering, University of Valladolid, Valladolid, Spain
Abdellatif Hair: ORCiD; Department of Computer Science, Faculty of Sciences and Technology, Sultan Moulay Slimane University, Beni Mellal, Morocco
Begona Garcia-Zapirain: ORCiD; eVIDA Research Group, University of Deusto, Bilbao, Spain
Isabel De La Torre-Diez: ORCiD; Department of Signal Theory, Communications and Telematics Engineering, University of Valladolid, Valladolid, Spain

DOI: https://doi.org/10.1109/ACCESS.2021.3073215
Journal volume & issue: Vol. 9
pp. 58706 – 58739

Abstract

Read online

Opinion Mining (OM) is a field of Natural Language Processing (NLP) that aims to capture human sentiment in the given text. With the ever-spreading of online purchasing websites, micro-blogging sites, and social media platforms, OM in online social media platforms has picked the interest of thousands of scientific researchers. Because the reviews, tweets and blogs acquired from these social media networks, act as a significant source for enhancing the decision making process. The obtained textual data (reviews, tweets, or blogs) are classified into three different class labels which are negative, neutral and positive for analyzing and extracting relevant information from the given dataset. In this contribution, we introduce an innovative MapReduce improved weighted ID3 decision tree classification approach for OM, which consists mainly of three aspects: Firstly We have used several feature extractors to efficiently detect and capture the relevant data from the given tweets, including N-grams or character-level, Bag-Of-Words, word embedding (GloVe, Word2Vec), FastText, and TF-IDF. Secondly, we have applied a multiple feature selector to reduce the high feature’s dimensionality, including Chi-square, Gain Ratio, Information Gain, and Gini Index. Finally, we have employed the obtained features to carry out the classification task using an improved ID3 decision tree classifier, which aims to calculate the weighted information gain instead of information gain used in traditional ID3. In other words, to measure the weighted information gain for the current conditioned feature, we follow two steps: First, we compute the weighted correlation function of the current conditioned feature. Second, we multiply the obtained weighted correlation function by the information gain of this current conditioned feature. This work is implemented in a distributed environment using the Hadoop framework, with its programming framework MapReduce and its distributed file system HDFS. Its primary goal is to enhance the performance of a well-known ID3 classifier in terms of accuracy, execution time, and ability to handle the massive datasets. We have carried out several experiences that aims to assess the effectiveness of our suggested classifier compared to some other contributions chosen from the literature. The experimental results demonstrated that our ID3 classifier works better on COVID-19_Sentiments dataset than other classifiers in terms of Recall (85.72 %), specificity (86.51 %), error rate (11.18 %), false-positive rate (13.49 %), execution time (15.95s), kappa statistic (87.69 %), F1-score (85.54 %), classification rate (88.82 %), false-negative rate (14.28 %), precision rate (86.67 %), convergence (it convergent towards the iteration 90), stability (it is more stable with mean deviation standard equal to 0.12 %), and complexity (it requires much lower time and space computational complexity).

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords