IEEE Access (Jan 2021)

MCF Tree-Based Clustering Method for Very Large Mixed-Type Data Set

  • Hyeong-Cheol Ryu,
  • Sungwon Jung

DOI
https://doi.org/10.1109/ACCESS.2021.3118411
Journal volume & issue
Vol. 9
pp. 138580 – 138597

Abstract

Read online

Several clustering methods have been proposed for analyzing numerous mixed-type data sets composed of numeric and categorical attributes. However, existing clustering methods are not suitable for clustering very large mixed-type data sets because they require a high computational cost or a large memory size. We propose a novel clustering method for very large data sets using a mixed-type clustering feature (MCF) vector with summary information about a cluster. The MCF vector consists of the CF vector and a histogram to summarize the mixed-type values. Based on the MCF vector, we propose an MCF tree, along with a distance measure between the MCF vectors representing two clusters. Unlike previous studies that summarize a data set based on a fixed memory size, we estimate a small initial memory size of the data set for building the tree. Then, the memory size is adaptively increased to estimate a more accurate threshold by reflecting the size reduction in the re-built tree. Our theoretical analysis demonstrates the efficiency of the proposed approach. Experimental results on very large synthetic and real data sets demonstrate that the proposed approach clusters the data sets significantly faster than existing clustering methods while maintaining similar or better clustering accuracy.

Keywords