JMIR Formative Research (Mar 2023)

Calibrating a Transformer-Based Model’s Confidence on Community-Engaged Research Studies: Decision Support Evaluation Study

  • Brian Ferrell,
  • Sarah E Raskin,
  • Emily B Zimmerman

DOI
https://doi.org/10.2196/41516
Journal volume & issue
Vol. 7
p. e41516

Abstract

Read online

BackgroundDeep learning offers great benefits in classification tasks such as medical imaging diagnostics or stock trading, especially when compared with human-level performances, and can be a viable option for classifying distinct levels within community-engaged research (CEnR). CEnR is a collaborative approach between academics and community partners with the aim of conducting research that is relevant to community needs while incorporating diverse forms of expertise. In the field of deep learning and artificial intelligence (AI), training multiple models to obtain the highest validation accuracy is common practice; however, it can overfit toward that specific data set and not generalize well to a real-world population, which creates issues of bias and potentially dangerous algorithmic decisions. Consequently, if we plan on automating human decision-making, there is a need for creating techniques and exhaustive evaluative processes for these powerful unexplainable models to ensure that we do not incorporate and blindly trust poor AI models to make real-world decisions. ObjectiveWe aimed to conduct an evaluation study to see whether our most accurate transformer-based models derived from previous studies could emulate our own classification spectrum for tracking CEnR studies as well as whether the use of calibrated confidence scores was meaningful. MethodsWe compared the results from 3 domain experts, who classified a sample of 45 studies derived from our university’s institutional review board database, with those from 3 previously trained transformer-based models, as well as investigated whether calibrated confidence scores can be a viable technique for using AI in a support role for complex decision-making systems. ResultsOur findings reveal that certain models exhibit an overestimation of their performance through high confidence scores, despite not achieving the highest validation accuracy. ConclusionsFuture studies should be conducted with larger sample sizes to generalize the results more effectively. Although our study addresses the concerns of bias and overfitting in deep learning models, there is a need to further explore methods that allow domain experts to trust our models more. The use of a calibrated confidence score can be a misleading metric when determining our AI model’s level of competency.