Short Text Clustering Algorithms, Application and Challenges: A Survey

Majid Hameed Ahmed; Sabrina Tiun; Nazlia Omar; Nor Samsiah Sani

doi:10.3390/app13010342

Applied Sciences (Dec 2022)

Short Text Clustering Algorithms, Application and Challenges: A Survey

Majid Hameed Ahmed,
Sabrina Tiun,
Nazlia Omar,
Nor Samsiah Sani

Affiliations

Majid Hameed Ahmed: CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia
Sabrina Tiun: CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia
Nazlia Omar: CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia
Nor Samsiah Sani: CAIT, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia

DOI: https://doi.org/10.3390/app13010342
Journal volume & issue: Vol. 13, no. 1
p. 342

Abstract

Read online

The number of online documents has rapidly grown, and with the expansion of the Web, document analysis, or text analysis, has become an essential task for preparing, storing, visualizing and mining documents. The texts generated daily on social media platforms such as Twitter, Instagram and Facebook are vast and unstructured. Most of these generated texts come in the form of short text and need special analysis because short text suffers from lack of information and sparsity. Thus, this topic has attracted growing attention from researchers in the data storing and processing community for knowledge discovery. Short text clustering (STC) has become a critical task for automatically grouping various unlabelled texts into meaningful clusters. STC is a necessary step in many applications, including Twitter personalization, sentiment analysis, spam filtering, customer reviews and many other social network-related applications. In the last few years, the natural-language-processing research community has concentrated on STC and attempted to overcome the problems of sparseness, dimensionality, and lack of information. We comprehensively review various STC approaches proposed in the literature. Providing insights into the technological component should assist researchers in identifying the possibilities and challenges facing STC. To gain such insights, we review various literature, journals, and academic papers focusing on STC techniques. The contents of this study are prepared by reviewing, analysing and summarizing diverse types of journals and scholarly articles with a focus on the STC techniques from five authoritative databases: IEEE Xplore, Web of Science, Science Direct, Scopus and Google Scholar. This study focuses on STC techniques: text clustering, challenges to short texts, pre-processing, document representation, dimensionality reduction, similarity measurement of short text and evaluation.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords