A Joint Semantic Vector Representation Model for Text Clustering and Classification

S. Momtazi; A. Rahbar; D. Salami; I. Khanijazani

doi:10.22044/jadm.2019.7400.1876

Journal of Artificial Intelligence and Data Mining (Jul 2019)

A Joint Semantic Vector Representation Model for Text Clustering and Classification

S. Momtazi,
A. Rahbar,
D. Salami,
I. Khanijazani

Affiliations

S. Momtazi: Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
A. Rahbar: Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
D. Salami: Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.
I. Khanijazani: Computer Engineering and Information Technology Department, Amirkabir University of Technology, Tehran, Iran.

DOI: https://doi.org/10.22044/jadm.2019.7400.1876
Journal volume & issue: Vol. 7, no. 3
pp. 443 – 450

Abstract

Read online

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use semantic models for document vector representations. Latent Dirichlet allocation (LDA) topic modeling and doc2vec neural document embedding are two well-known techniques for this purpose. In this paper, we first study the conceptual difference between the two models and show that they have different behavior and capture semantic features of texts from different perspectives. We then proposed a hybrid approach for document vector representation to benefit from the advantages of both models. The experimental results on 20newsgroup show the superiority of the proposed model compared to each of the baselines on both text clustering and classification tasks. We achieved 2.6% improvement in F-measure for text clustering and 2.1% improvement in F-measure in text classification compared to the best baseline model.

Published in Journal of Artificial Intelligence and Data Mining

ISSN: 2322-5211 (Print); 2322-4444 (Online)
Publisher: Shahrood University of Technology
Country of publisher: Iran, Islamic Republic of
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: http://jad.shahroodut.ac.ir/

About the journal

Abstract

Keywords