COVIDHealth: A novel labeled dataset and machine learning-based web application for classifying COVID-19 discourses on Twitter

Mahathir Mohammad Bishal; Md. Rakibul Hassan Chowdory; Anik Das; Muhammad Ashad Kabir

Heliyon (Jul 2024)

COVIDHealth: A novel labeled dataset and machine learning-based web application for classifying COVID-19 discourses on Twitter

Mahathir Mohammad Bishal,
Md. Rakibul Hassan Chowdory,
Anik Das,
Muhammad Ashad Kabir

Affiliations

Mahathir Mohammad Bishal: Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram, 4349, Bangladesh
Md. Rakibul Hassan Chowdory: Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram, 4349, Bangladesh
Anik Das: Department of Computer Science, St. Francis Xavier University, Antigonish, B2G 2W5, NS, Canada
Muhammad Ashad Kabir: School of Computing, Mathematics, and Engineering, Charles Sturt University, Bathurst, 2795, NSW, Australia; Corresponding author.

Journal volume & issue: Vol. 10, no. 14
p. e34103

Abstract

Read online

The COVID-19 pandemic has sparked widespread health-related discussions on social media platforms like Twitter (now named ‘X’). However, the lack of labeled Twitter data poses significant challenges for theme-based classification and tweet aggregation. To address this gap, we developed a machine learning-based web application that automatically classifies COVID-19 discourses into five categories: health risks, prevention, symptoms, transmission, and treatment. We collected and labeled 6,667 COVID-19-related tweets using the Twitter API, and applied various feature extraction methods to extract relevant features. We then compared the performance of seven classical machine learning algorithms (Decision Tree, Random Forest, Stochastic Gradient Descent, Adaboost, K-Nearest Neighbor, Logistic Regression, and Linear SVC) and four deep learning techniques (LSTM, CNN, RNN, and BERT) for classification. Our results show that the CNN achieved the highest precision (90.41%), recall (90.4%), F1 score (90.4%), and accuracy (90.4%). The Linear SVC algorithm exhibited the highest precision (85.71%), recall (86.94%), and F1 score (86.13%) among classical machine learning approaches. Our study advances the field of health-related data analysis and classification, and offers a publicly accessible web-based tool for public health researchers and practitioners. This tool has the potential to support addressing public health challenges and enhancing awareness during pandemics. The dataset and application are accessible at https://github.com/Bishal16/COVID19-Health-Related-Data-Classification-Website.

Published in Heliyon

ISSN: 2405-8440 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General); Social Sciences: Social sciences (General)
Website: https://www.cell.com/heliyon/home

About the journal

Abstract

Keywords