Naučno-tehničeskij Vestnik Informacionnyh Tehnologij, Mehaniki i Optiki (Dec 2019)

TEXT CLUSTERING POWERED BY SEMANTICO-SYNTACTIC FEATURES

  • Sergei V. Lapshin,
  • Ilya S. Lebedev,
  • Anton I. Spivak

DOI
https://doi.org/10.17586/2226-1494-2019-19-6-1058-1063
Journal volume & issue
Vol. 19, no. 6
pp. 1058 – 1063

Abstract

Read online

Subject of Research. The performed study is devoted to improvement of the text clustering quality indicators. The main attention is paid to the feature extraction that describes the mathematical model of the texts. The k-means method is used for clustering of the resulting vector representation of the texts. Method. An analytical approach was proposed based on the use of semanticosyntactic features of the clustered texts. Feature extraction was performed using the Stanford CoreNLP Toolkit. Some links between the words of the texts in “Enhanced ++ Dependencies” representation were encoded together with the words connecting them. The values of semantico-syntactic features were calculated based on the frequencies of encoded links in the texts. Main Results. An experiment has shown that by comparison of the quality indicators of a prototype developed on the basis of the proposed method and a clustering system based on statistical features, the proposed method application provides for decrease in the number of clustering errors by more than 15 %. Practical Relevance. Pre-training is not required to obtain semanticosyntactic features of the texts. Therefore, the proposed approach can be used to improve clustering quality indicators in the absence of large text corpuses, which are necessary for pre-training of statistical language models based on word embeddings.

Keywords