Jisuanji kexue (Nov 2022)
Incorporating Part of Speech and Tonal Features for Vietnamese Grammatical Error Detection
Abstract
The BERT pre-trained language model removes the tones of the syllables when segmenting Vietnamese words,which leads to the loss of some semantic information during the training process of grammatical error detection model.To address this problem,an approach combining part of speech and tonal features is proposed to complete the semantic information of the input syllables.Grammatical error detection task confronts the problem of insufficient training data due to the scarcity of labeled Vietnamese data.To address this problem,a data augmentation algorithm is designed to generate a large number of error texts from the correct corpus.Experimental results on Vietnamese Wikipedia and news corpus show that the proposed method achieves the highest F0.5 and F1 score on the test set,which proves it improves the detection performance.Both the proposed method and the baseline model method have a gradual improvement with the increasing scales of the generated data,which proves that the proposed data augmentation algorithm is effective.
Keywords