Proceedings of the XXth Conference of Open Innovations Association FRUCT (Apr 2024)

Matching Literature Heritage Entities From Heterogeneous Data Sources Based On The Textual Description

  • Georgii Sipovskii,
  • Nikolay N Teslya

DOI
https://doi.org/10.23919/FRUCT61870.2024.10516371
Journal volume & issue
Vol. 35, no. 1
pp. 714 – https://youtu.be/7ZSKfCc5S9E

Abstract

Read online

The paper focuses on problem of short text matching for literature heritage entities alignment from heterogeneous data sources. The overview of existing methods showed that all of them works well for long texts. The paper proposes modification of Jacquard similarity metric for solving the problem based on similarity of unique text tokens adjusted to the specifics of literature heritage domain. Achieved results were evaluated on the literature heritage of the A.S. Pushkin gathered from the various heterogeneous sources (datasets, full works compilations. Encyclopedia of A.S. Pushkin) and shown high accuracy of finding corresponding entities within the system by developed method.

Keywords