BMC Bioinformatics (Jun 2006)

Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data

  • Chetty Madhu,
  • Ooi Chia,
  • Teng Shyh

DOI
https://doi.org/10.1186/1471-2105-7-320
Journal volume & issue
Vol. 7, no. 1
p. 320

Abstract

Read online

Abstract Background Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. Results We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. Conclusion For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.