npj Digital Medicine (Mar 2025)

Red teaming ChatGPT in medicine to yield real-world insights on model behavior

  • Crystal T. Chang,
  • Hodan Farah,
  • Haiwen Gui,
  • Shawheen Justin Rezaei,
  • Charbel Bou-Khalil,
  • Ye-Jean Park,
  • Akshay Swaminathan,
  • Jesutofunmi A. Omiye,
  • Akaash Kolluri,
  • Akash Chaurasia,
  • Alejandro Lozano,
  • Alice Heiman,
  • Allison Sihan Jia,
  • Amit Kaushal,
  • Angela Jia,
  • Angelica Iacovelli,
  • Archer Yang,
  • Arghavan Salles,
  • Arpita Singhal,
  • Balasubramanian Narasimhan,
  • Benjamin Belai,
  • Benjamin H. Jacobson,
  • Binglan Li,
  • Celeste H. Poe,
  • Chandan Sanghera,
  • Chenming Zheng,
  • Conor Messer,
  • Damien Varid Kettud,
  • Deven Pandya,
  • Dhamanpreet Kaur,
  • Diana Hla,
  • Diba Dindoust,
  • Dominik Moehrle,
  • Duncan Ross,
  • Ellaine Chou,
  • Eric Lin,
  • Fateme Nateghi Haredasht,
  • Ge Cheng,
  • Irena Gao,
  • Jacob Chang,
  • Jake Silberg,
  • Jason A. Fries,
  • Jiapeng Xu,
  • Joe Jamison,
  • John S. Tamaresis,
  • Jonathan H. Chen,
  • Joshua Lazaro,
  • Juan M. Banda,
  • Julie J. Lee,
  • Karen Ebert Matthys,
  • Kirsten R. Steffner,
  • Lu Tian,
  • Luca Pegolotti,
  • Malathi Srinivasan,
  • Maniragav Manimaran,
  • Matthew Schwede,
  • Minghe Zhang,
  • Minh Nguyen,
  • Mohsen Fathzadeh,
  • Qian Zhao,
  • Rika Bajra,
  • Rohit Khurana,
  • Ruhana Azam,
  • Rush Bartlett,
  • Sang T. Truong,
  • Scott L. Fleming,
  • Shriti Raj,
  • Solveig Behr,
  • Sonia Onyeka,
  • Sri Muppidi,
  • Tarek Bandali,
  • Tiffany Y. Eulalio,
  • Wenyuan Chen,
  • Xuanyu Zhou,
  • Yanan Ding,
  • Ying Cui,
  • Yuqi Tan,
  • Yutong Liu,
  • Nigam Shah,
  • Roxana Daneshjou

DOI
https://doi.org/10.1038/s41746-025-01542-0
Journal volume & issue
Vol. 8, no. 1
pp. 1 – 10

Abstract

Read online

Abstract Red teaming, the practice of adversarially exposing unexpected or undesired model behaviors, is critical towards improving equity and accuracy of large language models, but non-model creator-affiliated red teaming is scant in healthcare. We convened teams of clinicians, medical and engineering students, and technical professionals (80 participants total) to stress-test models with real-world clinical cases and categorize inappropriate responses along axes of safety, privacy, hallucinations/accuracy, and bias. Six medically-trained reviewers re-analyzed prompt-response pairs and added qualitative annotations. Of 376 unique prompts (1504 responses), 20.1% were inappropriate (GPT-3.5: 25.8%; GPT-4.0: 16%; GPT-4.0 with Internet: 17.8%). Subsequently, we show the utility of our benchmark by testing GPT-4o, a model released after our event (20.4% inappropriate). 21.5% of responses appropriate with GPT-3.5 were inappropriate in updated models. We share insights for constructing red teaming prompts, and present our benchmark for iterative model assessments.