Exploring language relations through syntactic distances and geographic proximity

Juan De Gregorio; Raúl Toral; David Sánchez

doi:10.1140/epjds/s13688-024-00498-7

EPJ Data Science (Sep 2024)

Exploring language relations through syntactic distances and geographic proximity

Juan De Gregorio,
Raúl Toral,
David Sánchez

Affiliations

Juan De Gregorio: Institute for Cross-Disciplinary Physics and Complex Systems IFISC (UIB-CSIC), Campus Universitat de les Illes Balears
Raúl Toral: Institute for Cross-Disciplinary Physics and Complex Systems IFISC (UIB-CSIC), Campus Universitat de les Illes Balears
David Sánchez: Institute for Cross-Disciplinary Physics and Complex Systems IFISC (UIB-CSIC), Campus Universitat de les Illes Balears

DOI: https://doi.org/10.1140/epjds/s13688-024-00498-7
Journal volume & issue: Vol. 13, no. 1
pp. 1 – 27

Abstract

Read online

Abstract Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.

Published in EPJ Data Science

ISSN: 2193-1127 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: http://www.epjdatascience.com/

About the journal

Abstract

Keywords