Engineering (Aug 2022)

Prediction of Driver Gene Matching in Lung Cancer NOG/PDX Models Based on Artificial Intelligence

  • Yayi He,
  • Haoyue Guo,
  • Li Diao,
  • Yu Chen,
  • Junjie Zhu,
  • Hiran C. Fernando,
  • Diego Gonzalez Rivas,
  • Hui Qi,
  • Chunlei Dai,
  • Xuzhen Tang,
  • Jun Zhu,
  • Jiawei Dai,
  • Kan He,
  • Dan Chan,
  • Yang Yang

Journal volume & issue
Vol. 15
pp. 102 – 114

Abstract

Read online

Patient-derived tumor xenografts (PDXs) are a powerful tool for drug discovery and screening in cancer. However, current studies have led to little understanding of genotype mismatches in PDXs, leading to massive economic losses. Here, we established PDX models from 53 lung cancer patients with a genotype matching rate of 79.2% (42/53). Furthermore, 17 clinicopathological features were examined and input in stepwise logistic regression (LR) models based on the lowest Akaike information criterion (AIC), least absolute shrinkage and selection operator (LASSO)-LR, support vector machine (SVM) recursive feature elimination (SVM-RFE), extreme gradient boosting (XGBoost), gradient boosting and categorical features (CatBoost), and the synthetic minority oversampling technique (SMOTE). Finally, the performance of all models was evaluated by the accuracy, area under the receiver operating characteristic curve (AUC), and F1 score in 100 testing groups. Two multivariable LR models revealed that age, number of driver gene mutations, epidermal growth factor receptor (EGFR) gene mutations, type of prior chemotherapy, prior tyrosine kinase inhibitor (TKI) therapy, and the source of the sample were powerful predictors. Moreover, CatBoost (mean accuracy = 0.960; mean AUC = 0.939; mean F1 score = 0.908) and the eight-feature SVM-RFE (mean accuracy = 0.950; mean AUC = 0.934; mean F1 score = 0.903) showed the best performance among the algorithms. Meanwhile, application of the SMOTE improved the predictive capability of most models, except CatBoost. Based on the SMOTE, the ensemble classifier of single models achieved the highest accuracy (mean = 0.975), AUC (mean = 0.949), and F1 score (mean = 0.938). In conclusion, we established an optimal predictive model to screen lung cancer patients for non-obese diabetic (NOD)/Shi-scid, interleukin-2 receptor (IL-2R) γnull (NOG)/PDX models and offer a general approach for building predictive models.

Keywords