FOCT: Fast Overlapping Clustering for Textual Data

Atefeh Khazaei; Hamidreza Khaleghzadeh; Mohammad Ghasemzadeh

doi:10.1109/ACCESS.2021.3130094

IEEE Access (Jan 2021)

FOCT: Fast Overlapping Clustering for Textual Data

Atefeh Khazaei,
Hamidreza Khaleghzadeh,
Mohammad Ghasemzadeh

Affiliations

Atefeh Khazaei: ORCiD; School of Computer Science, University College Dublin, Dublin 4, Ireland
Hamidreza Khaleghzadeh: ORCiD; School of Computing, University of Portsmouth, Portsmouth, U.K.
Mohammad Ghasemzadeh: ORCiD; Computer Group, Engineering Campus, Yazd University, Yazd, Iran

DOI: https://doi.org/10.1109/ACCESS.2021.3130094
Journal volume & issue: Vol. 9
pp. 157670 – 157680

Abstract

Read online

Text clustering is used to extract specific information from textual data and even categorizes text based on topic and sentiment. Due to inherent overlapping in textual documents, overlapping clustering algorithms have become a suitable approach for text analysing. However, state-of-the-art algorithms are not fast enough to analyse a large volume of textual data within tolerable time limits. In this research, we propose our text clustering algorithm, FOCT, which is a fast overlapping extension of SOM, one of the best algorithms for clustering textual data. We apply some heuristics to extract special characteristics presented in textual data and establish a very fast overlapping clustering algorithm. We use fast methods to represent the vectors of documents, compute the similarity of documents and neurons and update the weights of neurons. In our algorithm, each document can belong to one or more neurons and this is in line with what many documents have in their essence. We analyse the efficiency of the proposed algorithm over k-means, OKM, SOM and OSOM clustering approaches and experimentally demonstrate that it runs 12 to 690 times faster, and the overlap size of FOCT clusters is closer to the overlap size of the original data. The quality of clusters is also measured by four different internal and external evaluation criteria where FOCT clusters represent up to 64% better quality.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords