IEEE Access (Jan 2019)

Text Similarity Measurement of Semantic Cognition Based on Word Vector Distance Decentralization With Clustering Analysis

  • Shenghan Zhou,
  • Xingxing Xu,
  • Yinglai Liu,
  • Runfeng Chang,
  • Yiyong Xiao

DOI
https://doi.org/10.1109/ACCESS.2019.2932334
Journal volume & issue
Vol. 7
pp. 107247 – 107258

Abstract

Read online

Text similarity measurement, which is a basic task in natural language processing, is widely used in text information mining, news classification and clustering, artificial intelligence, and other fields. This paper proposes a text similarity measure method named word vector distance decentralization (WVDD) which can deal with complex semantic relations, including sentence components, word order and weights for Chinese language. Then, the clustering analysis is performed for the obtained similarity results. A K-means algorithm based on Spark architecture for parallel computing is adopted to accelerate clustering speed here. In experimental verification, the test sets are significant number of customer comments posted on the Jingdong website, which is a comprehensive online shopping mall. F-measure is used to evaluate the accuracy of the results obtained by the proposed method. The superiority of the proposed method is verified and compared with the sentence vector model (Doc2vec) and bag-of-words model. The proposed method can be applied to analyze network language, such as customers' comments online and web chat data.

Keywords