Jisuanji kexue yu tansuo (Dec 2023)
Survey of Automatic Labeling Methods for Topic Models
Abstract
Topic models are often used in modeling unstructured corpora and discrete data to extract the latent topic. As topics are generally expressed in the form of word lists, it is usually difficult for users to understand the meanings of topics, especially when users lack knowledge in the subject area. Although manually labeling topics can generate more explanatory and easily understandable topic labels, the cost is too high for the method to be feasible. Therefore, research on automatic labeling of topic discovered provides solutions to the problem. Firstly, the currently most popular technique, latent Dirichlet allocation (LDA), is elaborated and analyzed. According to the three different representations of topic labels, based on phrases, abstracts, and pictures, the topic labeling methods are classified into three types. Then, centered on improving the interpretability of topics, with different types of generated topic labels utilized, the relevant research in recent years is sorted out, analyzed, and summarized. The applicable scenarios and usability of different labels are also discussed. Meanwhile, methods are further categorized according to their different characteristics. The focus is placed on the quantitative and qualitative analysis of the abstract topic labels generated through lexical-based, submodular optimization, and graph-based methods. The differences between separate methods with respect to the learning types, technologies used, and data sources are then compared. Finally, the existing problems and trend of development of research on automatic topic labeling are discussed. Based on deep learning, integrating with sentiment analysis, and continuously expanding the applicable scenarios of topic labeling, will be the directions of future development.
Keywords