Annals of Surgery Open (Dec 2023)
The Accuracy of the NSQIP Universal Surgical Risk Calculator Compared to Operation-Specific Calculators
Abstract
Objective:. To compare the performance of the ACS NSQIP “universal” risk calculator (N-RC) to operation-specific RCs. Background:. Resources have been directed toward building operation-specific RCs because of an implicit belief that they would provide more accurate risk estimates than the N-RC. However, operation-specific calculators may not provide sufficient improvements in accuracy to justify the costs in development, maintenance, and access. Methods:. For the N-RC, a cohort of 5,020,713 NSQIP patient records were randomly divided into 80% for machine learning algorithm training and 20% for validation. Operation-specific risk calculators (OS-RC) and OS-RCs with operation-specific predictors (OSP-RC) were independently developed for each of 6 operative groups (colectomy, whipple pancreatectomy, thyroidectomy, abdominal aortic aneurysm (open), hysterectomy/myomectomy, and total knee arthroplasty) and 14 outcomes using the same 80%/20% rule applied to the appropriate subsets of the 5M records. Predictive accuracy was evaluated using the area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC), and Hosmer-Lemeshow (H-L) P values, for 13 binary outcomes, and mean squared error for the length of stay outcome. Results:. The N-RC was found to have greater AUROC (P = 0.002) and greater AUPRC (P < 0.001) compared to the OS-RC. No other statistically significant differences in accuracy, across the 3 risk calculator types, were found. There was an inverse relationship between the operation group sample size and magnitude of the difference in AUROC (r = −0.278; P = 0.014) and in AUPRC (r = −0.425; P < 0.001) between N-RC and OS-RC. The smaller the sample size, the greater the superiority of the N-RC. Conclusions:. While operation-specific RCs might be assumed to have advantages over a universal RC, their reliance on smaller datasets may reduce their ability to accurately estimate predictor effects. In the present study, this tradeoff between operation specificity and accuracy, in estimating the effects of predictor variables, favors the N-R, though the clinical impact is likely to be negligible.