sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic

Pijush Das; Anirban Roychowdhury; Subhadeep Das; Susanta Roychoudhury; Sucheta Tripathy; Sucheta Tripathy

doi:10.3389/fgene.2020.00247

Frontiers in Genetics (Apr 2020)

sigFeature: Novel Significant Feature Selection Method for Classification of Gene Expression Data Using Support Vector Machine and t Statistic

Pijush Das,
Anirban Roychowdhury,
Subhadeep Das,
Susanta Roychoudhury,
Sucheta Tripathy,
Sucheta Tripathy

Affiliations

Pijush Das: Computational Genomics lab, Structural Biology and Bioinformatics Division, CSIR- Indian Institute of Chemical Biology, Kolkata, India
Anirban Roychowdhury: Department of Oncogene Regulation, Chittaranjan National Cancer Institute, Kolkata, India
Subhadeep Das: Computational Genomics lab, Structural Biology and Bioinformatics Division, CSIR- Indian Institute of Chemical Biology, Kolkata, India
Susanta Roychoudhury: Saroj Gupta Cancer Centre and Research Institute, Kolkata, India
Sucheta Tripathy: Computational Genomics lab, Structural Biology and Bioinformatics Division, CSIR- Indian Institute of Chemical Biology, Kolkata, India
Sucheta Tripathy: Academy of Scientific and Innovative Research, New Delhi, India

DOI: https://doi.org/10.3389/fgene.2020.00247
Journal volume & issue: Vol. 11

Abstract

Read online

Biological data are accumulating at a faster rate, but interpreting them still remains a problem. Classifying biological data into distinct groups is the first step in understanding them. Data classification in response to a certain treatment is an extremely important aspect for differentially expressed genes in making present/absent calls. Many feature selection algorithms have been developed including the support vector machine recursive feature elimination procedure (SVM-RFE) and its variants. Support vector machine RFEs are greedy methods that attempt to find superlative possible combinations leading to binary classification, which may not be biologically significant. To overcome this limitation of SVM-RFE, we propose a novel feature selection algorithm, termed as “sigFeature” (https://bioconductor.org/packages/sigFeature/), based on SVM and t statistic to discover the differentially significant features along with good performance in classification. The “sigFeature” R package is centered around a function called “sigFeature,” which provides automatic selection of features for the binary classification. Using six publicly available microarray data sets (downloaded from Gene Expression Omnibus) with different biological attributes, we further compared the performance of “sigFeature” to three other feature selection algorithms. A small number of selected features (by “sigFeature”) also show higher classification accuracy. For further downstream evaluation of its biological signature, we conducted gene set enrichment analysis with the selected features (genes) from “sigFeature” and compared it with the outputs of other algorithms. We observed that “sigFeature” is able to predict the signature of four out of six microarray data sets accurately, whereas the other algorithms predict less data set signatures. Thus, “sigFeature” is considerably better than related algorithms in discovering differentially significant features from microarray data sets.

Published in Frontiers in Genetics

ISSN: 1664-8021 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Science: Biology (General): Genetics
Website: http://journal.frontiersin.org/journal/genetics

About the journal

Abstract

Keywords