Exploration of Medicine (Jun 2022)

Construction and validation of gastric cancer diagnosis model based on machine learning

  • Fei Kong,
  • Ziqin Yan,
  • Ning Lan,
  • Pinxiu Wang,
  • Shanlin Fan,
  • Wenzhen Yuan

DOI
https://doi.org/10.37349/emed.2022.00094
Journal volume & issue
Vol. 3, no. 3
pp. 300 – 313

Abstract

Read online

Aim: To screen differentially expressed genes related to gastric cancer based on The Cancer Genome Atlas (TCGA) database and construct a gastric cancer diagnosis model by machine learning. Methods: Transcriptional data, genomic data, and clinical information of gastric cancer tissues and non-gastric cancer tissues were downloaded from the TCGA database, and differentially expressed genes of gastric cancer messenger RNA (mRNA) and long non-coding RNA (lncRNA) were screened out. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyzed the differentially expressed genes, and the protein-protein interaction (PPI) of differentially expressed genes was constructed. Core differentially expressed genes were screened by Cytoscape software’s molecular complex detection (MCODE) plug-in. The differential genes of lncRNA were analyzed by univariate Cox regression analysis and lasso regression for further dimension reduction to obtain the core genes. The core genes were screened by machine learning to construct the gastric cancer diagnosis model. The efficiency of the gastric cancer diagnosis model was verified externally by the Gene Expression Omnibus (GEO) database. Results: Finally, 10 genes including long intergenic non-protein coding RNA 1821 (LINC01821), AL138826.1, AC022164.1, adhesion G protein-coupled receptor D1-antisense RNA 1 (ADGRD1-AS1), cyclin B1 (CCNB1), kinesin family member 11 (KIF11), Aurora kinase B (AURKB), cyclin dependent kinase 1 (CDK1), nucleolar and spindle associated protein 1 (NUSAP1), and TTK protein kinase (TTK) were screened as gastric cancer diagnostic model genes. After efficiency analysis, it was found that the random forest algorithm model had the best comprehensive evaluation, with an accuracy of 92% and an area under the curve (AUC) of 0.9722, which was more suitable for building a gastric cancer diagnosis model. The GSE54129 data set was used to verify the gastric cancer diagnosis model with an AUC of 0.904, indicating that the gastric cancer diagnosis model had high accuracy. Conclusions: Machine learning can simplify the bioinformatics analysis process and improve efficiency. The core gene discovered in this study is expected to become a gene chip for the diagnosis of gastric cancer.

Keywords