IEEE Access (Jan 2023)
Finding the Number of Clusters Using a Small Training Sequence
Abstract
In clustering the training sequence (TS), K-means algorithm tries to find empirically optimal representative vectors that achieve the empirical minimum to inductively design optimal representative vectors yielding the true optimum for the underlying distribution. In this paper, the convergence rates on the clustering errors are first observed as functions of $\beta ^{-\alpha }$ , where $\beta $ is the training ratio that relates the training sequence size to the number of representative vectors, and $\alpha $ is a non-negative constant. From the convergence rates, we can observe the training performance for a finite TS size. If the TS size is relatively small, errors occur in finding the number of clusters. In order to reduce the errors from small TS sizes, a compensation constant $(1-\beta ^{-\alpha })^{-1}$ for the empirical errors is devised based on the rate analyses and a novel algorithm for finding the number of clusters is proposed. The compensation constant can be applied to other clustering applications especially when the TS size is relatively small.
Keywords