IEEE Access (Jan 2024)

Extracting Semantic Topics About Development in Africa From Social Media

  • Harriet Sibitenda,
  • Awa Diattara,
  • Assitan Traore,
  • Ruofan Hu,
  • Dongyu Zhang,
  • Elke Rundensteiner,
  • Cheikh Ba

DOI
https://doi.org/10.1109/ACCESS.2024.3466834
Journal volume & issue
Vol. 12
pp. 142343 – 142359

Abstract

Read online

The extraction of knowledge about prevalent issues discussed on social media in Africa using Artificial Intelligence techniques is vital for informing public governance. The objectives of our study are twofold: (a) to develop machine learning-based models to identify common topics of social concern related to Africa on social media, and (b) to design a classifier capable of inferring the most relevant topic associated with a given social media post. We designed a three-step framework to achieve the first goal of topic identification. The first step applies text-based representation learning methods to generate text embeddings for feature representation. The second step utilizes state-of-the-art Natural Language Processing models, commonly referred to as topic modeling, to group the representations into categories. The third step generates topics from each group, leveraging large language models to create meaningful short-sentence labels from the associated bag-of-tokens. Additionally, we used Llama2 to refine the token words into concise single-word themes that describe each topic in relation to social concerns about development. To address the second goal of classification, we trained classifiers using ensemble voting and stacking methods to determine which of the identified topics best characterizes a given social media post. For our experimental study, we collected a corpus called Social Media for Africa (SMA), consisting of 22,036 records extracted from comments on Twitter (X) and YouTube. The clustering-based model BERTopic produced 304 topics with a topic coherence score of 0.81 (C-v). After merging the topics into broader classes, the BERTopic+ model yielded 11 common topic classes with a coherence score of 0.76 (C-v). For theme extraction, we further refined the leading token words using Llama2, resulting in 98 unique themes labeled by BERTopic_theme, with a coherence score of 0.75 and an IRBO score of 0.50. We used the identified topics, based on the groupings, as labels for training a topic classifier. These labels were generated using Llama2 on our SMA corpus. Our comparative study of topic classifiers employing stacking and voting schemes demonstrated that the BERTopic model achieved an accuracy of 0.83 and an F1 score of 0.82 with ensemble voting for training on topics. Furthermore, when training on topic classes, BERTopic+ with ensemble voting achieved the highest accuracy (0.95) and F1 score (0.95) compared to other methods. Additionally, BERTopic_theme achieved superior performance with an ensemble voting classifier, attaining an F1 score of 0.93 and an accuracy of 0.93. The overall performance of classifiers using ensemble stacking was slightly better than that of voting methods for short-sentence topic labeling. For Africa, policymakers should focus on the most pressing social issues: the impact of COVID-19 restrictions on public health and economic recovery, promoting entrepreneurial innovation in energy and environmental sustainability to combat climate change, and responding strategically to China’s rise in global politics to maintain geopolitical stability and foster international cooperation.

Keywords