Cross-lingual transfer of sentiment classifiers

Marko Robnik-Šikonja; Kristjan Reba; Igor Mozetič

doi:10.4312/slo2.0.2021.1.1-25

Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave (Jul 2021)

Cross-lingual transfer of sentiment classifiers

Marko Robnik-Šikonja,
Kristjan Reba,
Igor Mozetič

Affiliations

Marko Robnik-Šikonja: University of Ljubljana, Faculty of Computer and Information Science, Slovenia
Kristjan Reba: University of Ljubljana, Faculty of Computer and Information Science, Slovenia
Igor Mozetič: Jožef Stefan Institute, Ljubljana, Slovenia

DOI: https://doi.org/10.4312/slo2.0.2021.1.1-25
Journal volume & issue: Vol. 9, no. 1

Abstract

Read online

Word embeddings represent words in a numeric space so that semantic relations between words are represented as distances and directions in the vector space. Cross-lingual word embeddings transform vector spaces of different languages so that similar words are aligned. This is done by mapping one language’s vector space to the vector space of another language or by construction of a joint vector space for multiple languages. Cross-lingual embeddings can be used to transfer machine learning models between languages, thereby compensating for insufficient data in less-resourced languages. We use cross-lingual word embeddings to transfer machine learning prediction models for Twitter sentiment between 13 languages. We focus on two transfer mechanisms that recently show superior transfer performance. The first mechanism uses the trained models whose input is the joint numerical space for many languages as implemented in the LASER library. The second mechanism uses large pretrained multilingual BERT language models. Our experiments show that the transfer of models between similar languages is sensible, even with no target language data. The performance of cross-lingual models obtained with the multilingual BERT and LASER library is comparable, and the differences are language-dependent. The transfer with CroSloEngual BERT, pretrained on only three languages, is superior on these and some closely related languages.

Published in Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave

ISSN: 2335-2736 (Online)
Publisher: University of Ljubljana Press (Založba Univerze v Ljubljani)
Country of publisher: Slovenia
LCC subjects: Language and Literature: Philology. Linguistics
Website: https://journals.uni-lj.si/slovenscina2

About the journal

Abstract

Keywords