Short Text Document Clustering using Distributed Word Representation and Document Distance

Supavit KONGWUDHIKUNAKORN; Kitsana WAIYAMAI

doi:10.14456/vol16iss1pp%p

Walailak Journal of Science and Technology (Mar 2018)

Short Text Document Clustering using Distributed Word Representation and Document Distance

Supavit KONGWUDHIKUNAKORN,
Kitsana WAIYAMAI

Affiliations

Supavit KONGWUDHIKUNAKORN: Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900
Kitsana WAIYAMAI: Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900

DOI: https://doi.org/10.14456/vol16iss1pp%p
Journal volume & issue: Vol. 16, no. 2

Abstract

Read online

This paper presents a method for clustering short text documents, such as instant messages, SMS, or news headlines. Vocabularies in the texts are expanded using external knowledge sources and represented by a Distributed Word Representation. Clustering is done using the K-means algorithm with Word Mover's Distance as the distance metric. Experiments were done to compare the clustering quality of this method, and several leading methods, using large datasets from BBC headlines, SearchSnippets, StackExchange, and Twitter. For all datasets, the proposed algorithm produced document clusters with higher accuracy, precision, F1-score, and Adjusted Rand Index. We also observe that cluster description can be inferred from keywords represented in each cluster.

Published in Walailak Journal of Science and Technology

ISSN: 1686-3933 (Print); 2228-835X (Online)
Publisher: Walailak University
Country of publisher: Thailand
LCC subjects: Technology: Technology (General); Science: Science (General)
Website: http://wjst.wu.ac.th/index.php/wjst/index

About the journal

Abstract

Keywords