Jisuanji kexue (Mar 2023)

Fine-grained Semantic Knowledge Graph Enhanced Chinese OOV Word Embedding Learning

  • CHEN Shurui, LIANG Ziran, RAO Yanghui

DOI
https://doi.org/10.11896/jsjkx.220700249
Journal volume & issue
Vol. 50, no. 3
pp. 72 – 82

Abstract

Read online

With the expansion of the scope in informatization fields,lots of text corpora in specific fields continue to appear.Due to the impact of security and sensitivity,the text corpora in these specific fields(e.g.,medical records corpora and communication corpora) are often small-scaled.It is difficult for traditional word embedding learning methods to obtain high-quality embeddings on these corpora.On the other hand,there may exist many out-of-vocabulary words in these corpora when using the existing pre-training language models directly,for which,many words cannot be represented as vectors and the performance on downstream tasks are limited.Many researchers start to study how to infer the semantics of out-of-vocabulary words and obtain effective out-of-vocabulary word embeddings based on fine-grained semantic information.However,the current models utilizing fine-grained semantic information mainly focus on the English corpora and they only model the relationship among fine-grained semantic information by simple ways of concatenation or mapping,which leads to a poor model robustness.Aiming at addressing the above problems,this paper first proposes to construct a fine-grained knowledge graph by exploiting Chinese word formation rules,such as the characters contained in Chinese words,as well as the character components and pinyin of Chinese characters.The know-ledge graph not only captures the relationship between Chinese characters and Chinese words,but also represents the multiple and complex relationships between Pinyin and Chinese characters,components and Chinese characters,and other fine-grained semantic information.Next,the relational graph convolution operation is performed on the knowledge graph to model the deeper relationship between fine-grained semantics and word semantics.The method further mines the relationship between fine-grained semantics by the sub-graph readout,so as to effectively infer the semantic information of Chinese out-of-vocabulary words.Experimental results show that our model achieves better performance on specific corpora with a large proportion of out-of-vocabulary words when applying to tasks such as word analogy,word similarity,text classification,and named entity recognition.

Keywords