Journal of Big Data (Apr 2022)

A clustering-based topic model using word networks and word embeddings

  • Wenchuan Mu,
  • Kwan Hui Lim,
  • Junhua Liu,
  • Shanika Karunasekera,
  • Lucia Falzon,
  • Aaron Harwood

DOI
https://doi.org/10.1186/s40537-022-00585-4
Journal volume & issue
Vol. 9, no. 1
pp. 1 – 38

Abstract

Read online

Abstract Online social networking services like Twitter are frequently used for discussions on numerous topics of interest, which range from mainstream and popular topics (e.g., music and movies) to niche and specialized topics (e.g., politics). Due to the popularity of such services, it is a challenging task to automatically model and determine the numerous discussion topics given the large amount of tweets. Adding on this complexity is the need to identify these topics with the absence of prior knowledge about both the types and number of topics, while having the requirement of the relevant technical expertise to tune the numerous parameters for the various models. To address this challenge, we develop the Clustering-based Topic Modelling (ClusTop) algorithm that first constructs different types of word networks based on different types of n-grams co-occurrence and word embedding distances. Using these word networks, ClusTop is then able to automatically determine the discussion topics using community detection approaches. In contrast to traditional topic models, ClusTop does not require the tuning or setting of numerous parameters and instead uses community detection approaches to automatically determine the appropriate number of topics. The ClusTop algorithm is also able to capture the syntactic meaning in tweets via the use of bigrams, trigrams, other word combinations and word embedding techniques in constructing the word network graph, and utilizes edge weights based on word embedding. Using three Twitter datasets with labelled crises and events as topics, we show that ClusTop outperforms various traditional baselines in terms of topic coherence, pointwise mutual information, precision, recall and F-score.

Keywords