Text Similarity Measurement of Semantic Cognition Based on Word Vector Distance Decentralization With Clustering Analysis

Shenghan Zhou; Xingxing Xu; Yinglai Liu; Runfeng Chang; Yiyong Xiao

doi:10.1109/ACCESS.2019.2932334

IEEE Access (Jan 2019)

Text Similarity Measurement of Semantic Cognition Based on Word Vector Distance Decentralization With Clustering Analysis

Shenghan Zhou,
Xingxing Xu,
Yinglai Liu,
Runfeng Chang,
Yiyong Xiao

Affiliations

Shenghan Zhou: ORCiD; School of Reliability and Systems Engineering, Beihang University, Beijing, China
Xingxing Xu: ORCiD; School of Reliability and Systems Engineering, Beihang University, Beijing, China
Yinglai Liu: School of Reliability and Systems Engineering, Beihang University, Beijing, China
Runfeng Chang: School of Information Science and Technology, North China University of Technology, Beijing, China
Yiyong Xiao: School of Reliability and Systems Engineering, Beihang University, Beijing, China

DOI: https://doi.org/10.1109/ACCESS.2019.2932334
Journal volume & issue: Vol. 7
pp. 107247 – 107258

Abstract

Read online

Text similarity measurement, which is a basic task in natural language processing, is widely used in text information mining, news classification and clustering, artificial intelligence, and other fields. This paper proposes a text similarity measure method named word vector distance decentralization (WVDD) which can deal with complex semantic relations, including sentence components, word order and weights for Chinese language. Then, the clustering analysis is performed for the obtained similarity results. A K-means algorithm based on Spark architecture for parallel computing is adopted to accelerate clustering speed here. In experimental verification, the test sets are significant number of customer comments posted on the Jingdong website, which is a comprehensive online shopping mall. F-measure is used to evaluate the accuracy of the results obtained by the proposed method. The superiority of the proposed method is verified and compared with the sentence vector model (Doc2vec) and bag-of-words model. The proposed method can be applied to analyze network language, such as customers' comments online and web chat data.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords