IEEE Access (Jan 2024)

Low-Resource POS Tagging With Deep Affix Representation and Multi-Head Attention

  • Alim Murat,
  • Samat Ali

DOI
https://doi.org/10.1109/ACCESS.2024.3395454
Journal volume & issue
Vol. 12
pp. 66495 – 66504

Abstract

Read online

Part-of-speech (POS) tagging is a challenging and foundational task in the field of natural language processing (NLP), which commonly leverages the learned representations of individual word and character encodings within those words. However, neither of these representations explicitly leverage the profound semantics of sub-word units, such as roots, stems, and affixes, particularly in languages characterized by rich morphology and low resources. For this reason, this becomes a major limitation that leads to numerous unknown words and ambiguities in POS tagging task for agglutinative languages. In this work, a deep representation approach for word prefixes and suffixes is introduced using character n-grams approximation method to further augment features at both word and character levels. Then, a multi-head attention mechanism is applied to attain contextual dependencies among words, which can effectively resolve POS tag ambiguity. Finally, the customized dataset named MultiPOS_ukg is created for Uyghur, Uzbek, and Kyrgyz languages according to the uniform tag sets. Empirically, the proposed method is tested on both the customized dataset and the METU Turkish Treebank dataset. The overall performance demonstrated a significant improvement in POS tagging accuracy, with increases of up to 5.36%, 4.13%, and 2.1% across three MRLs. This improvement is achieved through the incorporation of affix-based word representation and multi-head attention, surpassing all other word and character-based models.

Keywords