IEEE Access (Jan 2020)

Leveraging Global and Local Topic Popularities for LDA-Based Document Clustering

  • Peng Yang,
  • Yu Yao,
  • Huajian Zhou

DOI
https://doi.org/10.1109/ACCESS.2020.2969525
Journal volume & issue
Vol. 8
pp. 24734 – 24745

Abstract

Read online

Document clustering is of high importance for many natural language technologies. A wide range of computational traditional topic models, such as LDA (Latent Dirichlet Allocation) and its variants, have made great progress. However, traditional LDA-based clustering algorithms might not give good results due to such probabilistic models require prior distributions which are always difficult to define. In this paper, we propose a probabilistic model named tpLDA, which incorporates different levels of topic popularity information to determine the prior LDA distribution, discover the latent topics and achieve better clustering. Specifically, global topic popularity is introduced to reduce the potential distraction in local cluster popularity and the local cluster popularity draws more attention on certain parts of the global topic popularity. The two popularities contribute complementary information and their integration can dynamically adjust statistical parameters of the model. Experimental evaluations on real data sets show that, compared with state-of-the-art approaches, our proposed framework dramatically improves the accuracy of documents clustering.

Keywords