علوم و فنون مدیریت اطلاعات (Dec 2023)

A Survey of Semantic Search and Retrieval Approaches for Persian and Arabic Texts

  • Ali Mirarab

DOI
https://doi.org/10.22091/stim.2024.10402.2067
Journal volume & issue
Vol. 9, no. 4
pp. 185 – 204

Abstract

Read online

Purpose: In recent decades, web search engines have become one of the most prominent and essential tools for accessing information in today's interconnected world. With the increasing volume of information available on the web, the demand for locating and accessing relevant and meaningful information has also risen. Traditional search engines typically retrieve results based on keyword matching and the number of similar entries in the texts. This method often leads to undesirable and irrelevant results. These problems are even more pronounced in Persian and Arabic due to the complex grammar of these languages, which is not machine-readable. The aim of this research is to review and present solutions for semantic search and retrieval of Persian and Arabic texts. Method: This research is a content analysis study, and the library method was used to collect data. To collect information and access the required resources, various sources were used, including scientific articles, books, theses, and reports. For collecting Persian articles, sources, and for collecting English articles, sources with publication dates from 2020 onwards were used. The content analysis method was utilized to analyze the collected data. By employing data analysis and interpretation methods, the results of previous studies were reviewed and evaluated alongside the new findings of the research. This evaluation involved identifying the issues and constraints of current semantic search engines and offering suggestions for enhancement. Findings: In Persian and Arabic text semantic search and information retrieval research, methods based on text semantic analysis and processing using pre-trained language models, clustering algorithms like K-Means, and knowledge resources such as knowledge graphs are employed. Additionally, the dataset, the utilization of models and algorithms, and the method of semantic search and retrieval between words all influence the system's performance and accuracy. According to the findings of numerous studies, there is a wide range of methods and algorithms available for text semantic search and retrieval, each of which can produce different results. These findings demonstrate that each of the methods used has the ability to retrieve the semantic meaning of texts and varies in terms of search accuracy capabilities. An examination of the research findings reveals that some methods outperform others. These methods demonstrate strong semantic search capabilities by employing various techniques and algorithms such as topic analysis, neural networks, vector representations, and more. On the other hand, the appropriate method should be chosen based on the nature of the problem and the characteristics of the data. Each problem and dataset may have its own unique requirements. Selecting the best method and adjusting its parameters is critical for optimal performance. Conclusion: Each of the presented methods offers unique solutions for the issues and linguistic characteristics of the two languages, Persian and Arabic. Additionally, various methods utilizepre-trained language models like BERT, clustering algorithms such as K-Means, and knowledge resource-based retrieval systems like knowledge graphs. The presented solutions also utilize specific datasets and resources for training and evaluation. The differences in the dataset and how these models and algorithms are used and configured are critical. Some methods perform information retrieval based on meaning and semantic relationships between words, while others use keyword and root-based methods. This variation in the search and retrieval method can impact the system's performance and accuracy. Each method has a different performance and accuracy in retrieving information, which is attributed to the varied ways in which models, algorithms, and data sources are utilized.

Keywords