Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering

Mutasem K. Alsmadi; Malek Alzaqebah; Sana Jawarneh; Ibrahim ALmarashdeh; Mohammed Azmi Al-Betar; Maram Alwohaibi; Noha A. Al-Mulla; Eman AE Ahmed; Ahmad AL Smadi

doi:10.1186/s40537-024-00930-9

Journal of Big Data (May 2024)

Hybrid topic modeling method based on dirichlet multinomial mixture and fuzzy match algorithm for short text clustering

Mutasem K. Alsmadi,
Malek Alzaqebah,
Sana Jawarneh,
Ibrahim ALmarashdeh,
Mohammed Azmi Al-Betar,
Maram Alwohaibi,
Noha A. Al-Mulla,
Eman AE Ahmed,
Ahmad AL Smadi

Affiliations

Mutasem K. Alsmadi: Department of MIS, College of Applied Studies and Community Service, Imam Abdulrahman Bin Faisal University
Malek Alzaqebah: Department of Mathematics, College of Science, Imam Abdulrahman Bin Faisal University
Sana Jawarneh: Computer Science Department, Community College Dammam, Imam Abdulrahman Bin Faisal University
Ibrahim ALmarashdeh: Department of MIS, College of Applied Studies and Community Service, Imam Abdulrahman Bin Faisal University
Mohammed Azmi Al-Betar: Artificial Intelligence Research Center (AIRC), College of Engineering and Information Technology, Ajman University
Maram Alwohaibi: Department of Mathematics, College of Science, Imam Abdulrahman Bin Faisal University
Noha A. Al-Mulla: Department of Mathematics, College of Science, Imam Abdulrahman Bin Faisal University
Eman AE Ahmed: Department of Mathematics, College of Science, Imam Abdulrahman Bin Faisal University
Ahmad AL Smadi: Department of Data Science and Artificial Intelligence, Zarqa University

DOI: https://doi.org/10.1186/s40537-024-00930-9
Journal volume & issue: Vol. 11, no. 1
pp. 1 – 21

Abstract

Read online

Abstract Topic modeling methods proved to be effective for inferring latent topics from short texts. Dealing with short texts is challenging yet helpful for many real-world applications, due to the sparse terms in the text and the high dimensionality representation. Most of the topic modeling methods require the number of topics to be defined earlier. Similarly, methods based on Dirichlet Multinomial Mixture (DMM) involve the maximum possible number of topics before execution which is hard to determine due to topic uncertainty, and many noises exist in the dataset. Hence, a new approach called the Topic Clustering algorithm based on Levenshtein Distance (TCLD) is introduced in this paper, TCLD combines DMM models and the Fuzzy matching algorithm to address two key challenges in topic modeling: (a) The outlier problem in topic modeling methods. (b) The problem of determining the optimal number of topics. TCLD uses the initial clustered topics generated by DMM models and then evaluates the semantic relationships between documents using Levenshtein Distance. Subsequently, it determines whether to keep the document in the same cluster, relocate it to another cluster, or mark it as an outlier. The results demonstrate the efficiency of the proposed approach across six English benchmark datasets, in comparison to seven topic modeling approaches, with 83% improvement in purity and 67% enhancement in Normalized Mutual Information (NMI) across all datasets. The proposed method was also applied to a collected Arabic tweet and the results showed that only 12% of the Arabic short texts were incorrectly clustered, according to human inspection.

Published in Journal of Big Data

ISSN: 2196-1115 (Online)
Publisher: SpringerOpen
Country of publisher: United Kingdom
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://journalofbigdata.springeropen.com

About the journal

Abstract

Keywords