IEEE Access (Jan 2025)

Automatic Seed Word Selection for Topic Modeling

  • Dahyun Jeong,
  • Jeongin Hwang,
  • Yunjin Choi,
  • Yoon-Yeong Kim

DOI
https://doi.org/10.1109/ACCESS.2025.3540410
Journal volume & issue
Vol. 13
pp. 31269 – 31285

Abstract

Read online

Topic modeling is widely used to uncover latent semantic topics from a corpus. However, topic models often struggle to identify minor topics due to their tendency to prioritize dominant patterns in the data. They are also hindered by polysemous words and general terms, which frequently appear in multiple contexts, making topic assignment difficult. Seed-guided topic modeling addresses these issues by incorporating prior knowledge through “seed words”. Existing approaches, however, primarily rely on supervised selection using label-dependent metrics or manual selection. Both are limited by scalability and susceptible to human bias, particularly when dealing with unstructured real-world data. As a result, the selection of seed words in unsupervised settings remains underexplored. To address these challenges, we propose an automated seed word selection process that identifies diverse and cohesive word sets based on inter-word relationships. We instantiate this process with $\textsf {SeedCapture}$ , an algorithm that utilizes co-occurrence to capture meaningful word associations. Unlike prior methods, $\textsf {SeedCapture}$ operates in a fully unsupervised manner, requiring no predefined labels or human intervention. $\textsf {SeedCapture}$ requires minimal parameter tuning and is highly adaptable, enabling seamless integration into existing seed-guided topic models. Through extensive quantitative and qualitative evaluations across multiple datasets and topic models, we demonstrate that $\textsf {SeedCapture}$ achieves results comparable to those obtained through supervised seed word selection.

Keywords