IEEE Access (Jan 2021)

An Empirical Study of Several Information Theoretic Based Feature Extraction Methods for Classifying High Dimensional Low Sample Size Data

  • Sheena Leeza Verghese,
  • Iman Yi Liao,
  • Tomas H. Maul,
  • Siang Yew Chong

DOI
https://doi.org/10.1109/ACCESS.2021.3077958
Journal volume & issue
Vol. 9
pp. 69157 – 69172

Abstract

Read online

A high dimensional low sample size (HDLSS) dataset typically contains many features but a limited number of samples. It is commonly found in domains such as microarray data and medical imaging. When sample size is small, the population probability density function (PDF) of a HDLSS dataset may not be well represented, causing difficulties of applying feature selection or feature extraction methods for HDLSS data classification. In this paper, we explore the possibility of designing feature selection and feature extraction methods for HDLSS data classification by making loose assumption on the underlying PDF of a HDLSS dataset. Specifically, we propose to leverage on Correlation Explanation (CorEx), a recent unsupervised probabilistic graphical model that assumes (hierarchical) hidden structure for generating subsets of features that are conditionally independent. We benchmark the proposed method against frequently cited Information Theory based feature extraction and feature selection methods, including Conditional Infomax Feature Extraction (CIFE), Maximum Relevance Minimum Redundancy (MRMR), Maximization of Mutual Information (MMI), Infomax Independent Component Analysis (Infomax ICA),and Kernel Entropy Component Analysis (KECA). The HDLSS datasets used in this study are Breast Cancer Dataset by Gravier et. al and West et. al, Colon Cancer dataset by Alon et. al., Leukemia Dataset by Golub et.al and the Gisette Dataset used by Guyon et. al. Experimental results demonstrate that the proposed method shows some improvement in classification performance over MMI, and Infomax ICA and is competitive with MRMR and CIFE.

Keywords