Journal of Informatics and Web Engineering (Jun 2024)
Knowledge-based Word Tokenization System for Urdu
Abstract
Word tokenization, a foundational step in natural language processing (NLP), is critical for tasks like part-of-speech tagging, named entity recognition, and parsing, as well as various independent NLP applications. In our tech-driven era, the exponential growth of textual data on the World Wide Web demands sophisticated tools for effective processing. Urdu, spoken widely across the globe, is experiencing a surge in, presents unique challenges due to its distinct writing style, the absence of capitalization features, and the prevalence of compound words. This study introduces a novel knowledge-based word tokenization system tailored for Urdu. Central to this system is a maximum matching model with forward and reverse variants, setting it apart from conventional approaches. The novelty of our system lies in its holistic approach, integrating knowledge-based techniques, dual-variant maximum matching, and heightened adaptability to low-resource language speakers, emphasizing the urgent need for advanced Urdu Language Processing (ULP) systems. However, Urdu, labeled as a low-resource language challenges compared to traditional machine learning (ML) approaches. Significantly, our system eliminates the need for a features file and pre-labelled datasets, streamlining the tokenization process. To evaluate the proposed model's efficacy, a comprehensive analysis was conducted on a dataset comprising 100 sentences with 5,000 Urdu words, yielding an impressive accuracy of 97%. This research makes a substantial contribution to Urdu language processing, providing an innovative solution to the complexities posed by the unique linguistic attributes of Urdu tokenization.
Keywords