Toward the Development of Large-Scale Word Embedding for Low-Resourced Language

Shahzad Nazir; Muhammad Asif; Shahbaz Ahmad Sahi; Shahbaz Ahmad; Yazeed Yasin Ghadi; Muhammad Haris Aziz

doi:10.1109/ACCESS.2022.3173259

IEEE Access (Jan 2022)

Toward the Development of Large-Scale Word Embedding for Low-Resourced Language

Shahzad Nazir,
Muhammad Asif,
Shahbaz Ahmad Sahi,
Shahbaz Ahmad,
Yazeed Yasin Ghadi,
Muhammad Haris Aziz

Affiliations

Shahzad Nazir: Department of Computer Science, National Textile University, Faisalabad, Pakistan
Muhammad Asif: ORCiD; Department of Computer Science, National Textile University, Faisalabad, Pakistan
Shahbaz Ahmad Sahi: ORCiD; Department of Computer Science, National Textile University, Faisalabad, Pakistan
Shahbaz Ahmad: Department of Computer Science, National Textile University, Faisalabad, Pakistan
Yazeed Yasin Ghadi: ORCiD; Department of Computer Science/Software Engineering, Al Ain University, Al Ain, United Arab Emirates
Muhammad Haris Aziz: ORCiD; Mechanical Engineering Department, University of Sargodha, Sargodha, Pakistan

DOI: https://doi.org/10.1109/ACCESS.2022.3173259
Journal volume & issue: Vol. 10
pp. 54091 – 54097

Abstract

Read online

Word embedding is possessed by Natural language processing as a key procedure for semantically and syntactically manipulating the unlabeled text corpus. While this process represents the extracted features of corpus on vector space that enables to perform the NLP tasks such as summary generation, text simplification, next sentence prediction, etc. There exist some approaches for word embedding that consider co-occurrence and word frequency, such as Matrix Factorization, skip-gram, hierarchical-structure regularizer, and noise contrastive estimation. These approaches have created mature word vectors for most spoken languages in the world, on the other hand, the research community turned their minor attention towards the Urdu language having 231.3 million speakers. This paper focuses on creating Urdu word embedding. To perform this task, we used a dataset covering different categories of News such as Business, Sports, Health, Politics, Entertainment, Science, world, and others. This dataset was tokenized while creating 288 million tokens. Further, for word vector formation we utilized skip-gram also known as the word2vec model. The embedding was performed while limiting the vector dimensions to 100, 200, 300, 400, 500, 128, 256, and 512. For evaluation Wordsim-353 and Lexsim-999 annotated datasets were utilized. The proposed work achieved a 0.66 Spearman correlation coefficient value for wordsim-353 and 0.439 for Lexsim-999. The results were compared with state-of-the-art and were observed better.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords