Applied Sciences (Dec 2024)
Word Vector Representation of Latin Cuengh Based on Root Feature Enhancement
Abstract
The Latin Cuengh is a kind of language used in China’s minority areas. Due to its complex pronunciation and semantic system, it is difficult to spread widely. To deal with and protect this language further, this paper considers using the current word vector representation technology to study it. Word vector representation is the basic method and an important foundation of current research on natural language processing. It relies on a large number of data resources and is obtained through the paradigm of pre-training and feature learning. Due to the extreme lack of Latin Cuengh corpus resources, it is very difficult to obtain word vectors by relying on a large amount of data training. In this study, we propose a word vector representation method that combines the root features of Latin Cuengh words. Specifically, while training and learning the Latin Cuengh language corpus, this method uses the special word roots in the Latin Cuengh language to modify the training process, which can enhance the expression ability of the root features. The method uses the mask method based on BERT to mask the word roots after word segmentation and predict the masked word roots in the output layer of the model to obtain a better vector representation of Latin Cuengh words. The experimental results show that the word vector representation method proposed in this paper is effective and has the ability to express Latin Cuengh semantics. The accuracy rate of words semantic is nearly 2% points higher than that of BERT representation, and the judgment of the semantic similarity of words is more accurate.
Keywords