Informatics in Medicine Unlocked (Jan 2023)

Feature selection for classification using WGCNA and Spread Sub-Sample for an imbalanced rheumatoid arthritis RNASEQ data

  • Consolata Gakii,
  • Victoria Mukami,
  • Boaz Too

Journal volume & issue
Vol. 43
p. 101402

Abstract

Read online

An imbalanced classification problem occurs when the distribution of samples among different classes is uneven or biased. Handling small and imbalanced training datasets poses a notable challenge in machine learning, especially in domains such as bioinformatics and medical research. These challenges can result in biased models, leading to poor performance on under-represented classes and an overemphasis on specific features, failing to capture the genuine patterns present in the data. The present study proposes a feature selection approach-based on genes connectivity and a class balancing technique for building a machine leaning model using imbalanced gene expression data. Rheumatic arthritis data composed of 28 normal samples and 152 rheumatic samples was used in testing our proposed model. Through the weighted gene co-expression network analysis (WGCNA) approach, features were reduced to 601 from 27,991 original features. The reduced features were used to build machine learning classification models with imbalanced and later balanced classes using Spread Sub-Sample technique. According to our findings, two classifiers reported higher accuracy with imbalanced data as compared to the balanced data set. This is an indication that most classifiers are biased when trained using imbalanced dataset. Logistic regression returned improved accuracy of 95%. The other two machine learning algorithms used in this study were decision tree and IBK returned reduced accuracy of 81% and 91% respectively. In conclusion, feature selection and class balancing approaches are important in reducing model execution time and accuracy especially for RNASeq gene expression data.

Keywords