Polysemy Needs Attention: Short-Text Topic Discovery With Global and Multi-Sense Information

Heng-Yang Lu; Jun Yang; Yi Zhang; Zuoyong Li

doi:10.1109/ACCESS.2021.3052863

IEEE Access (Jan 2021)

Polysemy Needs Attention: Short-Text Topic Discovery With Global and Multi-Sense Information

Heng-Yang Lu,
Jun Yang,
Yi Zhang,
Zuoyong Li

Affiliations

Heng-Yang Lu: ORCiD; School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China
Jun Yang: ORCiD; Marcpoint Company Ltd., Shanghai, China
Yi Zhang: Department of Computer Science and Technology, State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Zuoyong Li: ORCiD; Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, College of Computer and Control Engineering, Minjiang University, Fuzhou, China

DOI: https://doi.org/10.1109/ACCESS.2021.3052863
Journal volume & issue: Vol. 9
pp. 14918 – 14932

Abstract

Read online

The topic model has been widely applied to various research domains such as information retrieval, data mining, and so on. It can discover topics of texts in an unsupervised way. In the early years, most researches mainly focused on long texts. With the emergence of the Internet, the number of short texts is growing rapidly. Most existing schemes to solve the sparsity problems of short texts, are mainly based on data aggregation or model improvements. Among them, the Biterm Topic Model is one of the most representative models. It proposed a new way to model topics based on document-level word pairs and has shown creativity and effectiveness. However, this strategy ignores those semantically similar and rarely co-occurrent word pairs. What’s more, most researches ignore the multi-sense phenomenon in natural languages. In this paper, we utilize multi-sense word vectors to extract similar word pairs from the whole corpus by considering multiple senses. Based on this idea, we introduce a novel short-text topic model, which disambiguates multiple senses of words and generates more reasonable global biterms. Experimental results on two open-source English datasets have shown superiority to state-of-the-art topic models.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords