A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features

Mushtaq Ali; Muzammil Khan; Yasser Alharbi

doi:10.7717/peerj-cs.2577

PeerJ Computer Science (Dec 2024)

A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features

Mushtaq Ali,
Muzammil Khan,
Yasser Alharbi

Affiliations

Mushtaq Ali: Department of Computer and Software Technology, University of Swat, Swat, KP, Pakistan
Muzammil Khan: Department of Computer and Software Technology, University of Swat, Swat, KP, Pakistan
Yasser Alharbi: College of Computer Science and Engineering, University of Hail, Hail, Saudi Arabia

DOI: https://doi.org/10.7717/peerj-cs.2577
Journal volume & issue: Vol. 10
p. e2577

Abstract

Read online Read online

Part-of-speech (POS) tagging is the process of assigning tags or labels to each word of a text based on the grammatical category. It provides the ability to understand the grammatical structure of a text and plays an important role in many natural language processing tasks like syntax understanding, semantic analysis, text processing, information retrieval, machine translation, and named entity recognition. The POS tagging involves sequential nature, context dependency, and labeling of each word. Therefore it is a sequence labeling task. The challenges faced in Urdu text processing including resource scarcity, morphological richness, free word order, absence of capitalization, agglutinative nature, spelling variations, and multipurpose usage of words raise the demand for the development of machine learning automatic POS tagging systems for Urdu. Therefore, a conditional random field (CRF) based supervised POS classifier has been developed for 33 different Urdu POS categories using the language-independent features of Urdu text for the Urdu news dataset MM-POST containing 119,276 tokens of seven different domains including Entertainment, Finance, General, Health, Politics, Science and Sports. An analysis of the proposed approach is presented, proving it superior to other Urdu POS tagging research for using a simpler strategy by employing fewer word-level features as context windows together with the word length. The effective utilization of these features for the POS tagging of Urdu text resulted in the state-of-the-art performance of the CRF model, achieving an overall classification accuracy of 96.1%.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords