IEEE Access (Jan 2019)
Joint Fine-Grained Components Continuously Enhance Chinese Word Embeddings
Abstract
The most common method of word embedding is to learn word vector representations from context information of large-scale text. However, Chinese words usually consist of characters, subcharacters, and strokes, and each part contains rich semantic information. The quality of Chinese word vectors is related to the accuracy of prediction. Therefore, to obtain high-quality Chinese character embedding, we propose a continuously enhanced word embedding model. The model starts with fine-grained strokes and adjacent stroke information and enhances subcharacter embedding by combining the relationship vector representation between strokes. Similarly, we combine the subcharacter relationship vector and the character relationship vector to learn Chinese word embedding based on the enhanced subcharacter embedding. We construct the underlying stroke n-grams and adjacent stroke n-grams and extract the relationship vector used to enhance the relationship between the components, which can be used to learn Chinese word embedding and improve the accuracy. Finally, we evaluate our model on the word similarity calculations and word reasoning tasks.
Keywords