Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese

Xianwen Liao; Yongzhong Huang; Changfu Wei; Chenhao Zhang; Yongqing Deng; Ke Yi

doi:10.3390/app112211018

Applied Sciences (Nov 2021)

Efficient Estimate of Low-Frequency Words’ Embeddings Based on the Dictionary: A Case Study on Chinese

Xianwen Liao,
Yongzhong Huang,
Changfu Wei,
Chenhao Zhang,
Yongqing Deng,
Ke Yi

Affiliations

Xianwen Liao: School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
Yongzhong Huang: School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
Changfu Wei: School of Southeast Asian Studies, Guangxi University for Nationalities, Nanning 530006, China
Chenhao Zhang: School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
Yongqing Deng: School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
Ke Yi: College of Foreign Studies, Guilin University of Electronic Technology, Guilin 541004, China

DOI: https://doi.org/10.3390/app112211018
Journal volume & issue: Vol. 11, no. 22
p. 11018

Abstract

Read online

Obtaining high-quality embeddings of out-of-vocabularies (OOVs) and low-frequency words is a challenge in natural language processing (NLP). To efficiently estimate the embeddings of OOVs and low-frequency words, we propose a new method that uses the dictionary to estimate the embeddings of OOVs and low-frequency words. More specifically, the explanatory note of an entry in dictionaries accurately describes the semantics of the corresponding word. Naturally, we adopt the sentence representation model to extract the semantics of the explanatory note and regard the semantics as the embedding of the corresponding word. We design a new sentence representation model to encode sentences to extract the semantics from the explanatory notes of entries more efficiently. Based on the assumption that the higher quality of word embeddings will lead to better performance, we design an extrinsic experiment to evaluate the quality of low-frequency words’ embeddings. The experimental results show that the embeddings of low-frequency words estimated by our proposed method have higher quality. In addition, both intrinsic and extrinsic experiments show that our proposed sentence representation model can represent the semantics of sentences well.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords