Graph Based Feature Selection for Reduction of Dimensionality in Next-Generation RNA Sequencing Datasets

Consolata Gakii; Paul O. Mireji; Richard Rimiru

doi:10.3390/a15010021

Algorithms (Jan 2022)

Graph Based Feature Selection for Reduction of Dimensionality in Next-Generation RNA Sequencing Datasets

Consolata Gakii,
Paul O. Mireji,
Richard Rimiru

Affiliations

Consolata Gakii: Department of Computing and Information Technology, University of Embu, P.O. Box 6-60100, Embu 60100, Kenya
Paul O. Mireji: Biotechnology Research Institute, Kenya Agricultural and Livestock Research Organization, P.O. Box 362-00902, Kikuyu 00902, Kenya
Richard Rimiru: School of Computing and Information Technology, Jomo Kenyatta University of Agriculture and Technology, P.O. Box 62000-00200, Nairobi 00200, Kenya

DOI: https://doi.org/10.3390/a15010021
Journal volume & issue: Vol. 15, no. 1
p. 21

Abstract

Read online

Analysis of high-dimensional data, with more features (p) than observations (N) (p>N), places significant demand in cost and memory computational usage attributes. Feature selection can be used to reduce the dimensionality of the data. We used a graph-based approach, principal component analysis (PCA) and recursive feature elimination to select features for classification from RNAseq datasets from two lung cancer datasets. The selected features were discretized for association rule mining where support and lift were used to generate informative rules. Our results show that the graph-based feature selection improved the performance of sequential minimal optimization (SMO) and multilayer perceptron classifiers (MLP) in both datasets. In association rule mining, features selected using the graph-based approach outperformed the other two feature-selection techniques at a support of 0.5 and lift of 2. The non-redundant rules reflect the inherent relationships between features. Biological features are usually related to functions in living systems, a relationship that cannot be deduced by feature selection and classification alone. Therefore, the graph-based feature-selection approach combined with rule mining is a suitable way of selecting and finding associations between features in high-dimensional RNAseq data.

Published in Algorithms

ISSN: 1999-4893 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.mdpi.com/journal/algorithms

About the journal

Abstract

Keywords