Clustering of Monolingual Embedding Spaces

Kowshik Bhowmik; Anca Ralescu

doi:10.3390/digital3010004

Digital (Feb 2023)

Clustering of Monolingual Embedding Spaces

Kowshik Bhowmik,
Anca Ralescu

Affiliations

Kowshik Bhowmik: The College of Wooster, Mathematical and Computational Sciences, Wooster, OH 44691, USA
Anca Ralescu: Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221, USA

DOI: https://doi.org/10.3390/digital3010004
Journal volume & issue: Vol. 3, no. 1
pp. 48 – 66

Abstract

Read online

Suboptimal performance of cross-lingual word embeddings for distant and low-resource languages calls into question the isomorphic assumption integral to the mapping-based methods of obtaining such embeddings. This paper investigates the comparative impact of typological relationship and corpus size on the isomorphism between monolingual embedding spaces. To that end, two clustering algorithms were applied to three sets of pairwise degrees of isomorphisms. It is also the goal of the paper to determine the combination of the isomorphism measure and clustering algorithm that best captures the typological relationship among the chosen set of languages. Of the three measures investigated, Relational Similarity seemed to capture best the typological information of the languages encoded in their respective embedding spaces. These language clusters can help us identify, without any pre-existing knowledge about the real-world linguistic relationships shared among a group of languages, the related higher-resource languages of low-resource languages. The presence of such languages in the cross-lingual embedding space can help improve the performance of low-resource languages in a cross-lingual embedding space.

Published in Digital

ISSN: 2673-6470 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/digital

About the journal

Abstract

Keywords