Iranian Journal of Information Processing & Management (Mar 2023)

Optimizing the organization of Persian text documents using clustering technique

  • Elham Yalveh,
  • Yaghoub Norouzi,
  • Ashkan Khatir

DOI
https://doi.org/10.22034/jipm.2023.698613
Journal volume & issue
Vol. 38, no. 3
pp. 981 – 1010

Abstract

Read online

The present study aimed to designing a method for organizing Persian text documents using the clustering technique. The data set related to theses and dissertations including 2943 researches was considered as a statistical population. Data were collected from a set of data related to scientific research, which included 5,000 researches in Excel format. In this study, after converting the data into a structured format, the processing operation was performed using preprocessing operations. In the processing stage, the clustering technique was used to present the proposed algorithm in order to organize Persian text documents. This algorithm was introduced by improving the K-means algorithm for document clustering. The results of the evaluation showed that the proposed algorithm based on external criteria had a positive effect on the clustering quality of documents compared to the two algorithms K-means and K-means++. So that the research of each designated category in the related subject cluster had a uniform distribution, and led to the achievement of the purpose of the present study. In the category/cluster tables obtained from the two algorithms K-means and K-means++, we saw a non-uniform distribution of research in clusters, so the evaluation based on internal criteria was affected by different cluster densities and inter-cluster similarity. The size of the dataset was also not affected by the proposed solutions for selecting the final dataset and the research process, so the proposed algorithm works well for the high dimensions of the feature.

Keywords