IEEE Access (Jan 2019)

A Hybrid and Parameter-Free Clustering Algorithm for Large Data Sets

  • Hengkang Shao,
  • Ping Zhang,
  • Xinye Chen,
  • Fang Li,
  • Guanglong Du

DOI
https://doi.org/10.1109/ACCESS.2019.2900260
Journal volume & issue
Vol. 7
pp. 24806 – 24818

Abstract

Read online

As an important unsupervised learning method, clustering can find the hidden structures in data effectively. With the amount of data grows larger, clustering of large data sets is a challenging task. Many clustering algorithms have been developed to deal with small data sets, but they are often inefficient when the data sets are large. Meanwhile, most clustering algorithms require some extra parameters as input, which may not be easy to obtain in practical applications. This paper proposed a new clustering algorithm called hybrid and parameter-free clustering method (HPFCM). HPFCM is able to rapidly perform clustering on large data sets without knowing the number of clusters in advance. HPFCM is based on sampling on large data sets (MMRS* sampling), assessing the clustering tendency on samples (eVAT), determining the number of clusters (EPB), forming different partitions (MST tree cutting), and extending the results to the rest of the data sets. We compare HPFCM with the other three methods, which are popular in clustering large data sets. Several numerical and real-world experiments have been conducted to verify our algorithm. The results show the great potential and effectiveness of HPFCM for clustering large data sets.

Keywords