Neural POS tagging of shahmukhi by using contextualized word representations

Amina Tehseen; Toqeer Ehsan; Hannan Bin Liaqat; Amjad Ali; Ala Al-Fuqaha

Journal of King Saud University: Computer and Information Sciences (Jan 2023)

Neural POS tagging of shahmukhi by using contextualized word representations

Amina Tehseen,
Toqeer Ehsan,
Hannan Bin Liaqat,
Amjad Ali,
Ala Al-Fuqaha

Affiliations

Amina Tehseen: Department of Information Technology, University of Gujrat, Gujrat 50700, Pakistan
Toqeer Ehsan: Department of Computer Science, University of Gujrat, Gujrat 50700, Pakistan
Hannan Bin Liaqat: Department of Information Technology, Division of Science & Technology, University of Education, Township Campus, Lahore 54000, Pakistan
Amjad Ali: Information and Computing Technology (ICT) Division, College of Science and Engineering (CSE), Hamad Bin Khalifa University, Doha, Qatar
Ala Al-Fuqaha: Information and Computing Technology (ICT) Division, College of Science and Engineering (CSE), Hamad Bin Khalifa University, Doha, Qatar; Corresponding author.

Journal volume & issue: Vol. 35, no. 1
pp. 335 – 356

Abstract

Read online

Part of Speech (POS) tagging has a preliminary role in building natural language processing applications. This paper presents the development and evaluation of the first POS tagged corpus along with a Bi-directional long-short memory (BiLSTM) network based POS tagger for Shahmukhi (Western Punjabi) at this scale. A balanced corpus of 0.13 million words has been annotated which contains text from 14 different text domains. A Shahmukhi POS tagset has been devised by studying the applicability of the CLE Urdu POS tagset and tagging guidelines have also been designed for annotation. A multi-step corpus evaluation process has been employed for tagged corpus including grammar-based and n-gram based consistency evaluations. The average inter-annotator agreement for all domains is 95.35% along with an average Kappa coefficient of 0.94. The performance of the BiLSTM POS tagger has been compared with the well-known language independent TreeTagger and the Stanford POS tagger. The accuracy of the tagger has been further improved by employing transfer learning by training context-free (Word2Vec) and contextualized (ELMo) word representations on a corpus of 14.9 Shahmukhi words which has been collected from World Wide Web. The tagger performed with an f-score of 96.11 and the accuracy of 96.12%. For a morphologically-rich and low-resourced language, these POS tagging results are quite promising.

Published in Journal of King Saud University: Computer and Information Sciences

ISSN: 1319-1578 (Print)
Publisher: Elsevier
Country of publisher: Saudi Arabia
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.journals.elsevier.com/journal-of-king-saud-university-computer-and-information-sciences/

About the journal

Abstract

Keywords