Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

Hongwei Li; Hongyan Mao; Jingzi Wang

doi:10.3390/electronics11010056

Electronics (Dec 2021)

Part-of-Speech Tagging with Rule-Based Data Preprocessing and Transformer

Hongwei Li,
Hongyan Mao,
Jingzi Wang

Affiliations

Hongwei Li: Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China
Hongyan Mao: Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China
Jingzi Wang: Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China

DOI: https://doi.org/10.3390/electronics11010056
Journal volume & issue: Vol. 11, no. 1
p. 56

Abstract

Read online

Part-of-Speech (POS) tagging is one of the most important tasks in the field of natural language processing (NLP). POS tagging for a word depends not only on the word itself but also on its position, its surrounding words, and their POS tags. POS tagging can be an upstream task for other NLP tasks, further improving their performance. Therefore, it is important to improve the accuracy of POS tagging. In POS tagging, bidirectional Long Short-Term Memory (Bi-LSTM) is commonly used and achieves good performance. However, Bi-LSTM is not as powerful as Transformer in leveraging contextual information, since Bi-LSTM simply concatenates the contextual information from left-to-right and right-to-left. In this study, we propose a novel approach for POS tagging to improve the accuracy. For each token, all possible POS tags are obtained without considering context, and then rules are applied to prune out these possible POS tags, which we call rule-based data preprocessing. In this way, the number of possible POS tags of most tokens can be reduced to one, and they are considered to be correctly tagged. Finally, POS tags of the remaining tokens are masked, and a model based on Transformer is used to only predict the masked POS tags, which enables it to leverage bidirectional contexts. Our experimental result shows that our approach leads to better performance than other methods using Bi-LSTM.

Published in Electronics

ISSN: 2079-9292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics
Website: http://www.mdpi.com/journal/electronics

About the journal

Abstract

Keywords