Applied Sciences (Mar 2021)

Machine Learning Model for Lymph Node Metastasis Prediction in Breast Cancer Using Random Forest Algorithm and Mitochondrial Metabolism Hub Genes

  • Byung-Chul Kim,
  • Jingyu Kim,
  • Ilhan Lim,
  • Dong Ho Kim,
  • Sang Moo Lim,
  • Sang-Keun Woo

DOI
https://doi.org/10.3390/app11072897
Journal volume & issue
Vol. 11, no. 7
p. 2897

Abstract

Read online

Breast cancer metastasis can have a fatal outcome, with the prediction of metastasis being critical for establishing effective treatment strategies. RNA-sequencing (RNA-seq) is a good tool for identifying genes that promote and support metastasis development. The hub gene analysis method is a bioinformatics method that can effectively analyze RNA sequencing results. This can be used to specify the set of genes most relevant to the function of the cell involved in metastasis. Herein, a new machine learning model based on RNA-seq data using the random forest algorithm and hub genes to estimate the accuracy of breast cancer metastasis prediction. Single-cell breast cancer samples (56 metastatic and 38 non-metastatic samples) were obtained from the Gene Expression Omnibus database, and the Weighted Gene Correlation Network Analysis package was used for the selection of gene modules and hub genes (function in mitochondrial metabolism). A machine learning prediction model using the hub gene set was devised and its accuracy was evaluated. A prediction model comprising 54-functional-gene modules and the hub gene set (NDUFA9, NDUFB5, and NDUFB3) showed an accuracy of 0.769 ± 0.02, 0.782 ± 0.012, and 0.945 ± 0.016, respectively. The test accuracy of the hub gene set was over 93% and that of the prediction model with random forest and hub genes was over 91%. A breast cancer metastasis dataset from The Cancer Genome Atlas was used for external validation, showing an accuracy of over 91%. The hub gene assay can be used to predict breast cancer metastasis by machine learning.

Keywords