Model of Lexico-Semantic Bonds between Texts for Creating Their Similarity Metrics and Developing Statistical Clustering Algorithm

Liliya Demidova; Dmitry Zhukov; Elena Andrianova; Vladimir Kalinin

doi:10.3390/a16040198

Algorithms (Apr 2023)

Model of Lexico-Semantic Bonds between Texts for Creating Their Similarity Metrics and Developing Statistical Clustering Algorithm

Liliya Demidova,
Dmitry Zhukov,
Elena Andrianova,
Vladimir Kalinin

Affiliations

Liliya Demidova: Institute of Information Technology, MIREA-Russian Technological University, 78 Vernadsky Avenue, 119454 Moscow, Russia
Dmitry Zhukov: Institute of Cybersecurity and Digital Technologies, MIREA-Russian Technological University, 78 Vernadsky Avenue, 119454 Moscow, Russia
Elena Andrianova: Institute of Information Technology, MIREA-Russian Technological University, 78 Vernadsky Avenue, 119454 Moscow, Russia
Vladimir Kalinin: Institute of Radio Electronics and Informatics, MIREA-Russian Technological University, 78 Vernadsky Avenue, 119454 Moscow, Russia

DOI: https://doi.org/10.3390/a16040198
Journal volume & issue: Vol. 16, no. 4
p. 198

Abstract

Read online

To solve the problem of text clustering according to semantic groups, we suggest using a model of a unified lexico-semantic bond between texts and a similarity matrix based on it. Using lexico-semantic analysis methods, we can create “term–document” matrices based both on the occurrence frequencies of words and n-grams and the determination of the degrees of nodes in their semantic network, followed by calculating the cosine metrics of text similarity. In the process of the construction of the text similarity matrix using lexical or semantic analysis methods, the cosine of the angle for a vector pair describing such texts will determine the degree of similarity in the lexical or semantic presentation, respectively. Based on the averaging procedure described in this paper, we can obtain a matrix of cosine metric values that describes the lexico-semantic bonds between texts. We propose an algorithm for solving text clustering problems. This algorithm allows one to use the statistical characteristics of the distribution functions of element values in the rows of the cosine metric value matrix in the model of the lexico-semantic bond between documents. In addition, this algorithm allows one to separately describe the matrix of the cosine metric values obtained separately based on the lexical or semantic properties of texts. Our research has shown that the developed model for the lexico-semantic presentation of texts allows one to slightly increase the accuracy of their subsequent clustering. The statistical text clustering algorithm based on this model shows excellent results that are comparable to those of the widely used affinity propagation algorithm. Additionally, our algorithm does not require specification of the degree of similarity for combining vectors into a common cluster and other configuration parameters. The suggested model and algorithm significantly expand the list of known approaches for determining text similarity metrics and their clustering.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords