Applied Sciences (Feb 2024)

A Hierarchical Orthographic Similarity Measure for Interconnected Texts Represented by Graphs

  • Maxime Deforche,
  • Ilse De Vos,
  • Antoon Bronselaer,
  • Guy De Tré

DOI
https://doi.org/10.3390/app14041529
Journal volume & issue
Vol. 14, no. 4
p. 1529

Abstract

Read online

Similarity measures play a pivotal role in automatic techniques designed to analyse large volumes of textual data. Conventional approaches, treating texts as paradigmatic examples of unstructured data, tend to overlook their structural nuances, leading to a loss of valuable information. In this paper, we propose a novel orthographic similarity measure tailored for the semi-structured analysis of texts. We explore a graph-based representation for texts, where the graph’s structure is shaped by a hierarchical decomposition of textual discourse units. Employing the concept of edit distances, our orthographic similarity measure is computed hierarchically across all components in this textual graph, integrating precomputed similarity values among lower-level nodes. The relevance and applicability of the presented approach are illustrated by a real-world example, featuring texts that exhibit intricate interconnections among their components. The resulting similarity scores, between all different structural levels of the graph, allow for a deeper understanding of the (structural) interconnections among texts and enhances the explainability of similarity measures as well as the tools using them.

Keywords