Word Embedding Methods in Natural Language Processing: a Review

ZENG Jun, WANG Ziwei, YU Yang, WEN Junhao, GAO Min

doi:10.3778/j.issn.1673-9418.2303056

Jisuanji kexue yu tansuo (Jan 2024)

Word Embedding Methods in Natural Language Processing: a Review

ZENG Jun, WANG Ziwei, YU Yang, WEN Junhao, GAO Min

Affiliations

ZENG Jun, WANG Ziwei, YU Yang, WEN Junhao, GAO Min: 1. School of Big Data & Software Engineering, Chongqing University, Chongqing 401331, China 2. Key Laboratory of Dependable Service Computing in Cyber Physical Society (Chongqing University), Ministry of Education, Chongqing 400044, China

DOI: https://doi.org/10.3778/j.issn.1673-9418.2303056
Journal volume & issue: Vol. 18, no. 1
pp. 24 – 43

Abstract

Read online

Word embedding, as the first step in natural language processing (NLP) tasks, aims to transform input natural language text into numerical vectors, known as word vectors or distributed representations, which artificial intelligence models can process. Word vectors, the foundation of NLP tasks, are a prerequisite for accomplishing various NLP downstream tasks. However, most existing review literature on word embedding methods focuses on the technical routes of different word embedding methods, neglecting comprehensive analysis of the tokenization methods and the complete evolutionary trends of word embedding. This paper takes the introduction of the word2vec model and the Transformer model as pivotal points. From the perspective of whether generated word vectors can dynamically change their implicit semantic information to adapt to the overall semantics of input sentences, this paper categorizes word embedding methods into static and dynamic approaches and extensively discusses this classification. Simultaneously, it compares and analyzes tokenization methods in word embedding, including whole and sub-word segmentation. This paper also provides a detailed enumeration of the evolution of language models used to train word vectors, progressing from probability language models to neural probability language models and the current deep contextual language models. Additionally, this paper summarizes and explores the training strategies employed in pre-training language models. Finally, this paper concludes with a summary of methods for evaluating word vector quality, an analysis of the current state of word embedding methods, and a prospective outlook on their development.

word vector; word embedding; natural language processing; language model; tokenization; word vector evaluation

Published in Jisuanji kexue yu tansuo

ISSN: 1673-9418 (Print)
Publisher: Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://fcst.ceaj.org

About the journal

Abstract

Keywords