Sense Unveiled: Enhancing Urdu Corpus for Nuanced Word Sense Disambiguation

Sarfraz Bibi; Sohail Asghar; Muhammad Zubair

doi:10.1109/access.2024.3451528

IEEE Access (Jan 2024)

Sense Unveiled: Enhancing Urdu Corpus for Nuanced Word Sense Disambiguation

Sarfraz Bibi,
Sohail Asghar,
Muhammad Zubair

Affiliations

Sarfraz Bibi: ORCiD; Department of Computing, Riphah International University, Islamabad, Pakistan
Sohail Asghar: ORCiD; Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
Muhammad Zubair: ORCiD; Department of Computing, Riphah International University, Islamabad, Pakistan

DOI: https://doi.org/10.1109/access.2024.3451528
Journal volume & issue: Vol. 12
pp. 126329 – 126343

Abstract

Read online

Ambiguity in word meanings presents a significant challenge in natural language processing, necessitating robust techniques for Word Sense Disambiguation (WSD). While research in WSD has predominantly focused on widely spoken languages like English and Spanish, less attention has been given to languages such as Urdu. This paper addresses this gap by conducting a thorough examination of existing corpora for WSD in Urdu and presenting the creation of an Enhanced Urdu (EU) corpus specifically tailored for WSD tasks. The analysis encompasses a critical evaluation of the limitations of ULS-WSD-18 Corpus, and justifies the need for a more comprehensive resource. The EU corpus is meticulously curated, comprising 960 words categorized based on their frequency in the corpus into most frequent, moderate, and infrequent words. These words serve as the foundation for constructing sentences utilized in model training and testing. Various similarity coefficients are employed to assess the similarity between the EU corpus and the ULS-WSD-18 Corpus, revealing notable patterns in word occurrences, sense structures, and sentence compositions. The findings underscore the potential of the EU corpus to advance WSD research in Urdu language processing. By providing a comprehensive resource for model development and evaluation, this work contributes to the broader goal of improving language processing tools for Urdu and other underrepresented languages.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords