IEEE Access (Jan 2024)

A Deep Learning-Based Approach for Part of Speech (PoS) Tagging in the Pashto Language

  • Shaheen Ullah,
  • Riaz Ahmad,
  • Abdallah Namoun,
  • Siraj Muhammad,
  • Khalil Ullah,
  • Ibrar Hussain,
  • Isa Ali Ibrahim

DOI
https://doi.org/10.1109/ACCESS.2024.3412175
Journal volume & issue
Vol. 12
pp. 86355 – 86364

Abstract

Read online

A fundamental task in natural language processing (NLP) is part of speech (PoS) tagging. PoS tagging is crucial to many NLP applications, including question answering, machine translation, syntactic parsing, speech recognition, and semantic parsing. PoS tagging is a task for labeling sequences in which a tagger/system tags each word with its appropriate part of speech label. In NLP, PoS tagging is often considered as a language-specific task. Similarly, Pashto is a language that has not been explored regarding PoS tagging. Therefore, this research focuses on the PoS tagging considering the Pashto language and provides a baseline accuracy. The research has twofold benefits. First, it introduces a Pashto tag set that contains 2,81,205 words of the Pashto language. All these words are tagged with 17 unique PoS tags. Second, it proposes a deep learning-based model by examining classic Recursive Neural Networks (RNN) and Bidirectional Long Short Term Memory Networks (BLSTM). The results show promising performances when used with the word embedding technique. The proposed approach achieved 98.82% accuracy as a baseline on the test dataset by using the BLSTM model along with word embedding.

Keywords