Computers (Aug 2024)
Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning
Abstract
The Persian language, also known as Farsi, is distinguished by its intricate morphological richness, yet it contends with a paucity of linguistic resources. With an estimated 110 million speakers, it finds prevalence across Iran, Tajikistan, Uzbekistan, Iraq, Russia, Azerbaijan, and Afghanistan. However, despite its widespread usage, scholarly investigations into Persian document retrieval remain notably scarce. This circumstance is primarily attributed to the absence of standardized test collections, which impedes the advancement of comprehensive research endeavors within this realm. As data corpora are the foundation of natural language processing applications, this work aims at Persian language datasets to address their availability and structure. Subsequently, we motivate a learning-based framework for the processing of Persian texts and their recognition, for which current state-of-the-art approaches from deep learning, such as deep neural networks, are further discussed. Our investigations highlight the challenges of realizing such a system while emphasizing its possible benefits for an otherwise rarely covered language.
Keywords