IEEE Access (Jan 2020)
Hierarchical Dirichlet Multinomial Allocation Model for Multi-Source Document Clustering
Abstract
Mining a document structure from multiple data sources in terms of their underlying topics has become an important task of document clustering. The traditional document clustering approach cannot be applied directly to the multi-source document clustering problem. There are three typical difficulties: 1) The topics of different data sources are related but not the same. 2) Usually, each data source has its own focus on topics. 3) The number of clusters of the data sources are not necessarily the same and are not known beforehand. In this paper, based on our previous research, we design a novel multi-source document clustering model, namely, the hierarchical Dirichlet multinomial allocation (HDMA) model, to solve all the above problems. The HDMA model is investigated with a two-step hierarchical topic generation process. Topics are learnt to share their general characteristics across data source, while at the same time preserve the local characteristics of the data source. Each data source is applied with an exclusive topic partition to learn the source-level topic emphasis. A Gibbs sampling algorithm is then used to learn the number of clusters for each data source as well as the parameters of the HDMA model at the same time. Experimental results demonstrate that the HDMA model is effective.
Keywords