Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning

Sara Moniri; Tobias Schlosser; Danny Kowerko

doi:10.3390/computers13080212

Computers (Aug 2024)

Investigating the Challenges and Opportunities in Persian Language Information Retrieval through Standardized Data Collections and Deep Learning

Sara Moniri,
Tobias Schlosser,
Danny Kowerko

Affiliations

Sara Moniri: Junior Professorship of Media Computing, Chemnitz University of Technology, 09107 Chemnitz, Germany
Tobias Schlosser: Junior Professorship of Media Computing, Chemnitz University of Technology, 09107 Chemnitz, Germany
Danny Kowerko: Junior Professorship of Media Computing, Chemnitz University of Technology, 09107 Chemnitz, Germany

DOI: https://doi.org/10.3390/computers13080212
Journal volume & issue: Vol. 13, no. 8
p. 212

Abstract

Read online

The Persian language, also known as Farsi, is distinguished by its intricate morphological richness, yet it contends with a paucity of linguistic resources. With an estimated 110 million speakers, it finds prevalence across Iran, Tajikistan, Uzbekistan, Iraq, Russia, Azerbaijan, and Afghanistan. However, despite its widespread usage, scholarly investigations into Persian document retrieval remain notably scarce. This circumstance is primarily attributed to the absence of standardized test collections, which impedes the advancement of comprehensive research endeavors within this realm. As data corpora are the foundation of natural language processing applications, this work aims at Persian language datasets to address their availability and structure. Subsequently, we motivate a learning-based framework for the processing of Persian texts and their recognition, for which current state-of-the-art approaches from deep learning, such as deep neural networks, are further discussed. Our investigations highlight the challenges of realizing such a system while emphasizing its possible benefits for an otherwise rarely covered language.

Published in Computers

ISSN: 2073-431X (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.mdpi.com/journal/computers

About the journal

Abstract

Keywords