Greedy Texts Similarity Mapping

Aliya Jangabylova; Alexander Krassovitskiy; Rustam Mussabayev; Irina Ualiyeva

doi:10.3390/computation10110200

Computation (Nov 2022)

Greedy Texts Similarity Mapping

Aliya Jangabylova,
Alexander Krassovitskiy,
Rustam Mussabayev,
Irina Ualiyeva

Affiliations

Aliya Jangabylova: Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty 050010, Kazakhstan
Alexander Krassovitskiy: Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty 050010, Kazakhstan
Rustam Mussabayev: Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty 050010, Kazakhstan
Irina Ualiyeva: Faculty of Information Technology, Al-Farabi Kazakh National University, 71 Al-Farabi Ave., Almaty 050040, Kazakhstan

DOI: https://doi.org/10.3390/computation10110200
Journal volume & issue: Vol. 10, no. 11
p. 200

Abstract

Read online

The documents similarity metric is a substantial tool applied in areas such as determining topic in relation to documents, plagiarism detection, or problems necessary to capture the semantic, syntactic, or structural similarity of texts. Evaluated results of the similarity measure depend on the types of word represented and the problem statement and can be time-consuming. In this paper, we present a problem-independent algorithm of the similarity metric greedy texts similarity mapping (GTSM), which is computationally efficient to be applied for large datasets with any preferred word vectorization models. GTSM maps words in two texts based on a decision rule that evaluates word similarity and their importance to the texts. We compare it with the well-known word mover’s distance (WMD) algorithm in the k-nearest neighbors text classification problem and find that it leads to similar or better results. In the correlation evaluation task of similarity measures with human-judged scores, we demonstrate its higher correlation scores in comparison with WMD and sentence mover’s similarity (SMS) and show that GTSM is a decent alternative for both word-level and sentence-level tasks.

Published in Computation

ISSN: 2079-3197 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.mdpi.com/journal/computation

About the journal

Abstract

Keywords