An Adaptive LDA Optimal Topic Number Selection Method in News Topic Identification

Mingming Zheng; Kaizhong Jiang; Ranhui Xu; Lulu Qi

doi:10.1109/ACCESS.2023.3308520

IEEE Access (Jan 2023)

An Adaptive LDA Optimal Topic Number Selection Method in News Topic Identification

Mingming Zheng,
Kaizhong Jiang,
Ranhui Xu,
Lulu Qi

Affiliations

Mingming Zheng: ORCiD; School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science, Shanghai, China
Kaizhong Jiang: School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science, Shanghai, China
Ranhui Xu: School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science, Shanghai, China
Lulu Qi: School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science, Shanghai, China

DOI: https://doi.org/10.1109/ACCESS.2023.3308520
Journal volume & issue: Vol. 11
pp. 92273 – 92284

Abstract

Read online

Nowadays, news text information is exploding, and people need more and more heterogeneous news content. Therefore, news text topic identification is needed to help viewers quickly and accurately screen and filter news related to their interests to save time and energy. The Latent Dirichlet Allocation (LDA) model is the most commonly used method for text topic identification. The optimal number of topics must be specified in advance when using the LDA model to extract topics in previous studies. However, selecting the too-large or the too-small number of topics significantly impacts the final results of LDA topic models, directly determining the quality of topic extraction. Moreover, the news text datasets from social media are very time-sensitive, and the combination of temporal and semantic modeling has not been considered in past studies of news topic identification. This paper proposes an adaptive optimal topic number determination method for fusing semantic and temporal information in news datasets to address the existing problems. Semantic and temporal are first extracted in this method as two different views. Then, density peak clustering of multi-view information fusion is performed based on the two obtained feature vectors. The clustering results are used as the final optimal number of topics. To demonstrate the effectiveness of the proposed method, this paper compares the performance of four traditional methods for determining the optimal number of topics with the performance of this paper’s method on public datasets. The results show that the optimal number of topics considering semantic and temporal factors is significantly better than the other four traditional methods regarding F-value, PMI scores, and MI scores. It performs well in other indicators as well. The above experimental results show that the method proposed in this paper combines the temporal and semantic of news data to determine the optimal number of topics of news text, which can improve the accuracy of selecting the optimal number of topics in the LDA model and the effectiveness of the topic identification of news text to some extent. It can help viewers better understand and utilize the massive news text information. In addition, the method also broadens the idea of identifying and mining unique datasets from multiple perspectives.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords