English–Welsh Cross-Lingual Embeddings

Luis Espinosa-Anke; Geraint Palmer; Padraig Corcoran; Maxim Filimonov; Irena Spasić; Dawn Knight

doi:10.3390/app11146541

Applied Sciences (Jul 2021)

English–Welsh Cross-Lingual Embeddings

Luis Espinosa-Anke,
Geraint Palmer,
Padraig Corcoran,
Maxim Filimonov,
Irena Spasić,
Dawn Knight

Affiliations

Luis Espinosa-Anke: School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
Geraint Palmer: School of Mathematics, Cardiff University, Cardiff CF24 4AG, UK
Padraig Corcoran: School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
Maxim Filimonov: School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
Irena Spasić: School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
Dawn Knight: School of English, Communication and Philosophy, Cardiff University, Cardiff CF10 3EU, UK

DOI: https://doi.org/10.3390/app11146541
Journal volume & issue: Vol. 11, no. 14
p. 6541

Abstract

Read online

Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords