Research of BERT Cross-Lingual Word Embedding Learning

WANG Yurong, LIN Min, LI Yanling

doi:10.3778/j.issn.1673-9418.2101042

Jisuanji kexue yu tansuo (Aug 2021)

Research of BERT Cross-Lingual Word Embedding Learning

WANG Yurong, LIN Min, LI Yanling

Affiliations

WANG Yurong, LIN Min, LI Yanling: College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China

DOI: https://doi.org/10.3778/j.issn.1673-9418.2101042
Journal volume & issue: Vol. 15, no. 8
pp. 1405 – 1417

Abstract

Read online

With the development of multilingual information on the Internet, how to effectively represent the infor-mation contained in different language texts has become an important sub-task of natural language information processing. Therefore, cross-lingual word embedding has become a hot technology. Cross-lingual word embedding can be mapped to a shared low-dimensional space with the help of transfer learning, and the grammar semantic and struc-tural features can be transferred between different languages, which can be used to model cross-lingual semantic infor-mation. By training a large number of corpora, a general word embedding is obtained in BERT (bidirectional encoder representations from transformers) model, which is further dynamically optimized according to specific downstream tasks to generate context-sensitive word embedding, thus solving the aggregation problem of previous models and obtaining dynamic word embedding. Based on the literature review of the existing cross-lingual word embedding based on BERT studies, this paper comprehensively describes the development of cross-lingual word embedding learning based on BERT learning methods, models and techniques, as well as the required training data. According to different training methods, it is divided into two categories, supervised learning and unsupervised learning. And the representative research of the two types of methods is compared and summarized in detail. Finally, the evaluation methods of cross-lingual word embedding are summarized, and the prospect is made by studying the cross-lingual word embedding of Mongolian and Chinese based on BERT.

Published in Jisuanji kexue yu tansuo

ISSN: 1673-9418 (Print)
Publisher: Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://fcst.ceaj.org

About the journal

Abstract

Keywords