University of Chitral Journal of Linguistics and Literature (Jan 2024)
Developing Lexico-Semantic Relations of Saraiki Nouns: A Corpus-Based Study
Abstract
Saraiki, being the fourth most widely spoken language in Pakistan and being used in some parts of India and Afghanistan, is of significant geographical, historical, and cultural importance. However, it remains neglected in terms of proper documentation and identification of its unique linguistic features. The current study is centered on identifying the lexico-semantic categories of Saraiki nouns and then developing their hierarchical relationships (Miller et al., 1993). This quantitative research is designed to contribute to the process of developing Saraiki WordNet and is related to Natural Language Processing (NLP). A corpus of 3 million words was developed on the basis of data collected from different genres of the Saraiki language, including newspapers, academic essays, literary texts, and religious books. Both expansion and merge approaches were used to analyze the data. A wordlist of 1500 most occurring nouns was extracted from the corpus using Antconc 3.4.4.0, followed by manual tagging in Microsoft Excel 2010. Resultantly, 39 most occurring nouns from the wordlist were used to develop 173 related synsets, and lexico-semantic relationships among these nouns were identified with the help of 30 hierarchies (Miller et al., 1993). This study is limited to areas like Bahawalpur, Multan, and Muzaffarabad. It would be a milestone for Saraiki language learners, SWN development, Saraiki lexical resources, online SL dictionaries, and a guide for researchers.