Analyzing Indo-European Language Similarities Using Document Vectors

Samuel R. Schrader; Eren Gultepe

doi:10.3390/informatics10040076

Informatics (Sep 2023)

Analyzing Indo-European Language Similarities Using Document Vectors

Samuel R. Schrader,
Eren Gultepe

Affiliations

Samuel R. Schrader: Department of Computer Science, Southern Illinois University Edwardsville, Edwardsville, IL 62026, USA
Eren Gultepe: Department of Computer Science, Southern Illinois University Edwardsville, Edwardsville, IL 62026, USA

DOI: https://doi.org/10.3390/informatics10040076
Journal volume & issue: Vol. 10, no. 4
p. 76

Abstract

Read online

The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research.

Published in Informatics

ISSN: 2227-9709 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/informatics

About the journal

Abstract

Keywords