Enhancing Big Social Media Data Quality for Use in Short-Text Topic Modeling

Belal Abdullah Hezam Murshed; Jemal Abawajy; Suresha Mallappa; Mufeed Ahmed Naji Saif; Sumaia Mohammed Al-Ghuribi; Fahd A. Ghanem

doi:10.1109/ACCESS.2022.3211396

IEEE Access (Jan 2022)

Enhancing Big Social Media Data Quality for Use in Short-Text Topic Modeling

Belal Abdullah Hezam Murshed,
Jemal Abawajy,
Suresha Mallappa,
Mufeed Ahmed Naji Saif,
Sumaia Mohammed Al-Ghuribi,
Fahd A. Ghanem

Affiliations

Belal Abdullah Hezam Murshed: ORCiD; Department of Studies in Computer Science, University of Mysore, Mysore, Karnataka, India
Jemal Abawajy: ORCiD; Faculty of Science, Engineering and Built Environment, School of Information Technology, Deakin University, Geelong, VIC, Australia
Suresha Mallappa: Department of Studies in Computer Science, University of Mysore, Mysore, Karnataka, India
Mufeed Ahmed Naji Saif: ORCiD; Department of Computer Applications, Sri Jayachamarajendra College of Engineering (Affiliated to VTU), Mysore, Karnatake, India
Sumaia Mohammed Al-Ghuribi: ORCiD; Department of Computer Science, Faculty of Applied Sciences, Taiz University, Taiz, Yemen
Fahd A. Ghanem: ORCiD; Department of Computer Science & Engineering, PES College of Engineering (Affiliated to University of Mysore), Mandya, India

DOI: https://doi.org/10.1109/ACCESS.2022.3211396
Journal volume & issue: Vol. 10
pp. 105328 – 105351

Abstract

Read online

With the emergence of microblogging platforms and social media applications, large amounts of user-generated data in the form of comments, reviews, and brief text messages are produced every day. Microblog data is typically of poor quality; hence improving the quality of the data is a significant scientific and practical challenge. In spite of the relevance of the problem, there has been not much work so far, especially in regard to microblog data quality for Short-Text Topic Modelling (STTM) purposes. This paper addresses this problem and proposes an approach called the Social Media Data Cleansing Model (SMDCM) to improve data quality for STTM. We evaluate SMDCM using six topic modelling methods, namely the Latent Dirichlet Allocation (LDA), Word-Network Topic Model (WNTM), Pseudo-document-based Topic Modelling (PTM), Biterm Topic Model (BTM), Global and Local word embedding-based Topic Modeling (GLTM), and Fuzzy Topic modelling (FTM). We used the Real-world Cyberbullying Twitter (RW-CB-Twitter) and the Cyberbullying Mendeley (CB-MNDLY) datasets in the evaluation. The results proved the efficiency of the GLTM and WNTM over the other STTM models when applying the SMDCM techniques, which achieved optimum topic coherence and high accuracy values on RW-CB-Twitter and CB-MNDLY datasets.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords