IEEE Access (Jan 2024)

An Integrated Data Analysis Using Bioinformatics and Random Forest to Predict Prognosis of Patients With Squamous Cell Lung Cancer

  • Debora V. C. Lima,
  • Patrick Terrematte,
  • Beatriz Stransky,
  • Adriao D. D. Neto

DOI
https://doi.org/10.1109/ACCESS.2024.3392277
Journal volume & issue
Vol. 12
pp. 59335 – 59345

Abstract

Read online

Lung cancer is the leading cause of cancer death worldwide, regardless of gender. Among the types of lung cancer, Lung Squamous Cell Carcinoma (LUSC) is the second most common type, characterized by a diagnosis in advanced stages, a poor prognosis, and a high association with smoking. Due to the severity of lung cancer, it is essential to understand its molecular mechanisms. In this context, this study uses transcriptomic and clinical data to implement bioinformatics pipelines, and machine learning, through random forest models to predict patients’ overall survival and obtain a gene signature of LUSC for tumor progression. We analyzed clinical and molecular data from the project LUSC-TCGA, and we performed differential expression analyses (DEA) comparing normal tissues against tumor tissues. Based on DEA-selected genes, the patients were divided into three clusters, followed by a feature selection and classification. Finally, it was possible to obtain classifications results close to 70% of accuracy for the three clusters. Finally, we also performed a functional enrichment analysis. The clustering analysis revealed in cluster 2, enriched genes such as CDT1, CENPI, and NLGN1, associated with the molecular EMT (epithelial-to-mesenchymal transition) process. Our approach facilitated the identification of genes that are biologically relevant to the LUSC development process, holding significant genes for predicting patient survival, such as gene ALDH3B1, C7, FAM83A, FOSB, GCGR, BMP7, PPP1R27 and AQP1, and putative therapeutic targets for LUSC such as gene FAM83A, CAV1, TNS4, EIF4G1, TFAP2A, GCGR and PPP1R27.

Keywords