GAE-Based Document Embedding Method for Clustering

Sungwon Jung; Sangmin Ka

doi:10.1109/ACCESS.2022.3228548

IEEE Access (Jan 2022)

GAE-Based Document Embedding Method for Clustering

Sungwon Jung,
Sangmin Ka

Affiliations

Sungwon Jung: ORCiD; Department of Computer Science and Engineering, Sogang University, Seoul, South Korea
Sangmin Ka: ORCiD; Department of Computer Science and Engineering, Sogang University, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2022.3228548
Journal volume & issue: Vol. 10
pp. 130089 – 130096

Abstract

Read online

Document embedding methods for clustering using deep neural networks have been proposed recently. However, the existing deep neural network-based document embedding methods for clustering have a problem of either generating document embeddings dependent on a given number of document clusters or generating document embeddings that do not take into account the characteristic of high similarity between documents belonging to the same document cluster. In this paper, we propose a new document embedding method for clustering by using a graph autoencoder. To this end, we construct an undirected and weighted sparse graph from a set of documents wherein each document is represented by a node, and all the weighted edges created in the graph have high cosine similarities between the two end nodes. We then apply the proposed graph autoencoder to the graph to compute node embedding vectors. Each node embedding vector in the graph is used as a document embedding vector. This paper presents in-depth experimental analyses of the proposed method. Experimental results on various real document data sets demonstrate that the proposed approach affords the significant performance improvement over the existing document embedding methods.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords