Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling

Mubashar Mustafa; Feng Zeng; Hussain Ghulam; Hafiz Muhammad Arslan

doi:10.3390/info11110518

Information (Nov 2020)

Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling

Mubashar Mustafa,
Feng Zeng,
Hussain Ghulam,
Hafiz Muhammad Arslan

Affiliations

Mubashar Mustafa: School of Computer Science and Engineering, Central South University, 410083 Changsha, China
Feng Zeng: School of Computer Science and Engineering, Central South University, 410083 Changsha, China
Hussain Ghulam: School of Computer Science and Engineering, Central South University, 410083 Changsha, China
Hafiz Muhammad Arslan: School of Software Engineering, Northeastern University, 110819 Shenyang, China

DOI: https://doi.org/10.3390/info11110518
Journal volume & issue: Vol. 11, no. 11
p. 518

Abstract

Read online

Document clustering is to group documents according to certain semantic features. Topic model has a richer semantic structure and considerable potential for helping users to know document corpora. Unfortunately, this potential is stymied on text documents which have overlapping nature, due to their purely unsupervised nature. To solve this problem, some semi-supervised models have been proposed for English language. However, no such work is available for poor resource language Urdu. Therefore, document clustering has become a challenging task in Urdu language, which has its own morphology, syntax and semantics. In this study, we proposed a semi-supervised framework for Urdu documents clustering to deal with the Urdu morphology challenges. The proposed model is a combination of pre-processing techniques, seeded-LDA model and Gibbs sampling, we named it seeded-Urdu Latent Dirichlet Allocation (seeded-ULDA). We apply the proposed model and other methods to Urdu news datasets for categorizing. For the datasets, two conditions are considered for document clustering, one is “Dataset without overlapping” in which all classes have distinct nature. The other is “Dataset with overlapping” in which the categories are overlapping and the classes are connected to each other. The aim of this study is threefold: it first shows that unsupervised models (Latent Dirichlet Allocation (LDA), Non-negative matrix factorization (NMF) and K-means) are giving satisfying results on the dataset without overlapping. Second, it shows that these unsupervised models are not performing well on the dataset with overlapping, because, on this dataset, these algorithms find some topics that are neither entirely meaningful nor effective in extrinsic tasks. Third, our proposed semi-supervised model Seeded-ULDA performs well on both datasets because this model is straightforward and effective to instruct topic models to find topics of specific interest. It is shown in this paper that the semi-supervised model, Seeded-ULDA, provides significant results as compared to unsupervised algorithms.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords