IEEE Access (Jan 2021)
Polysemy Needs Attention: Short-Text Topic Discovery With Global and Multi-Sense Information
Abstract
The topic model has been widely applied to various research domains such as information retrieval, data mining, and so on. It can discover topics of texts in an unsupervised way. In the early years, most researches mainly focused on long texts. With the emergence of the Internet, the number of short texts is growing rapidly. Most existing schemes to solve the sparsity problems of short texts, are mainly based on data aggregation or model improvements. Among them, the Biterm Topic Model is one of the most representative models. It proposed a new way to model topics based on document-level word pairs and has shown creativity and effectiveness. However, this strategy ignores those semantically similar and rarely co-occurrent word pairs. What’s more, most researches ignore the multi-sense phenomenon in natural languages. In this paper, we utilize multi-sense word vectors to extract similar word pairs from the whole corpus by considering multiple senses. Based on this idea, we introduce a novel short-text topic model, which disambiguates multiple senses of words and generates more reasonable global biterms. Experimental results on two open-source English datasets have shown superiority to state-of-the-art topic models.
Keywords