Applied Sciences (Nov 2021)
Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese
Abstract
Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and low-frequency words. More specifically, the explanatory note of an entry in dictionaries accurately describes the semantics of the corresponding word. Naturally, we adopt the sentence representation model to extract the semantics of the explanatory note and regard the semantics as the embedding of the corresponding word. We design a new sentence representation model to encode sentences to extract the semantics from the explanatory notes of entries more efficiently. Based on the assumption that the higher quality of word embeddings will lead to better performance, we design an extrinsic experiment to evaluate the quality of low-frequency words’ embeddings. The experimental results show that the embeddings of low-frequency words estimated by our proposed method have higher quality. In addition, both intrinsic and extrinsic experiments show that our proposed sentence representation model can represent the semantics of sentences well.
Keywords