International Journal of Information Science and Management (Jan 2023)

Training vs Post-training Cross-lingual Word Embedding Approaches: A Comparative Study

  • Masood Ghayoomi

DOI
https://doi.org/10.22034/ijism.2022.1977779.0
Journal volume & issue
Vol. 21, no. 1
pp. 163 – 182

Abstract

Read online

This paper provides a comparative analysis of cross-lingual word embedding by studying the impact of different variables on the quality of the embedding models within the distributional semantics framework. Distributional semantics is a method for the semantic representation of words, phrases, sentences, and documents. This method aims at capturing as much information as possible from the contextual information in a vector space. The early study in this domain focused on monolingual word embedding. Further progress used cross-lingual data to capture the contextual semantic information across different languages. The main contribution of this research is to make a comparative study to find out the superior impact of the learning methods, supervised and unsupervised in training and post-training approaches in different embedding algorithms, to capture semantic properties of the words in cross-lingual embedding models to be applicable in tasks that deal with multi-languages, such as question retrieval. To this end, we study the cross-lingual embedding models created by BilBOWA, VecMap, and MUSE embedding algorithms along with the variables that impact the embedding models' quality, namely the size of the training data and the window size of the local context. In our study, we use the unsupervised monolingual Word2Vec embedding model as the baseline and evaluate the quality of embeddings on three data sets: Google analogy, mono- and cross-lingual words similar lists. We further investigated the impact of the embedding models in the question retrieval task.

Keywords