Journal of King Saud University: Computer and Information Sciences (Oct 2023)
CGUFS: A clustering-guided unsupervised feature selection algorithm for gene expression data
Abstract
(Aim) Gene expression data is typically high dimensional with a limited number of samples and contain many features that are unrelated to the disease of interest. Existing unsupervised feature selection algorithms primarily focus on the significance of features in maintaining the data structure while not taking into account the redundancy among features. Determining the appropriate number of significant features is another challenge. (Method) In this paper, we propose a clustering-guided unsupervised feature selection (CGUFS) algorithm for gene expression data that addresses these problems. Our proposed algorithm introduces three improvements over existing algorithms. For the problem that existing clustering algorithms require artificially specifying the number of clusters, we propose an adaptive k-value strategy to assign appropriate pseudo-labels to each sample by iteratively updating a change function. For the problem that existing algorithms fail to consider the redundancy among features, we propose a feature grouping strategy to group highly redundant features. For the problem that the existing algorithms cannot filter the redundant features, we propose an adaptive filtering strategy to determine the feature combinations to be retained by calculating the potentially effective features and potentially redundant features of each feature group. (Result) Experimental results show that the average accuracy (ACC) and matthews correlation coefficient (MCC) indexes of the C4.5 classifier on the optimal features selected by the CGUFS algorithm reach 74.37% and 63.84%, respectively, significantly superior to the existing algorithms. (Conclusion) Similarly, the average ACC and MCC indexes of the Adaboost classifier on the optimal features selected by the CGUFS algorithm are significantly superior to the existing algorithms. In addition, statistical experiment results show significant differences between the CGUFS algorithm and the existing algorithms.