Informatics (Sep 2023)

Analyzing Indo-European Language Similarities Using Document Vectors

  • Samuel R. Schrader,
  • Eren Gultepe

DOI
https://doi.org/10.3390/informatics10040076
Journal volume & issue
Vol. 10, no. 4
p. 76

Abstract

Read online

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.

Keywords