IEEE Access (Jan 2021)

Linear Model Selection and Regularization for Serum Prostate-Specific Antigen Prediction of Patients With Prostate Cancer Using R

  • Gongli Li,
  • Han Li

DOI
https://doi.org/10.1109/ACCESS.2021.3095914
Journal volume & issue
Vol. 9
pp. 97591 – 97602

Abstract

Read online

Prostate cancer is the commonly diagnosed cancer worldwide, and there were 1,276 thousand new prostate cancer cases and 359 thousand deaths in 2018. Prostate-specific antigen (PSA) blood level is often elevated in men with prostate cancer, so PSA testing can detect prostate tumours when they are small, low-grade, and localized. The PSA testing is hard to apply on the less developed and poor areas without sufficient medical funds, so the early accurate PSA level prediction by statistical machine learning models is significant to avoid later stages of prostate cancer that spread outside the Prostate. In this literature, we compare three linear model selection and regularization methods (shrinkage, subset selection, dimension reduction) and nine candidate models (OLS regression, Ridge regression, Lasso regression, Elastic net, best subset selection, forward subset selection, backward subset selection, PCR, PLS) based on leave-one-out-cross-validation (LOOCV) prediction error. As the selection criteria leave-one-out cross-validation is sensitive to outliers, Mahalanobis distance is used for outlier detection and deletion before running each model. The shrinkage method (only lasso and elastic net models) and subset selection method (based on adjusted $R^{2}$ , BIC, Cp, and cross-validation prediction error) can select the variables out. The feature selection results show that prostate weight, cancer volume, amount of benign prostatic hyperplasia, and whether seminal vesicle invasion is necessary variables must include predicting PSA. Age and capsular penetration are the least important variables. The variables of Gleason score, a percent of Gleason scores 4 or 5 are essential sometimes. All the diagnostic figures and results are coded by R, open access, and published on IEEE Xplore Code Ocean.

Keywords