Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study

Yuhe Ke; Rui Yang; Sui An Lie; Taylor Xin Yi Lim; Yilin Ning; Irene Li; Hairil Rizal Abdullah; Daniel Shu Wei Ting; Nan Liu

doi:10.2196/59439

Journal of Medical Internet Research (Nov 2024)

Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study

Yuhe Ke,
Rui Yang,
Sui An Lie,
Taylor Xin Yi Lim,
Yilin Ning,
Irene Li,
Hairil Rizal Abdullah,
Daniel Shu Wei Ting,
Nan Liu

Affiliations

Yuhe Ke: ORCiD
Rui Yang: ORCiD
Sui An Lie: ORCiD
Taylor Xin Yi Lim: ORCiD
Yilin Ning: ORCiD
Irene Li: ORCiD
Hairil Rizal Abdullah: ORCiD
Daniel Shu Wei Ting: ORCiD
Nan Liu: ORCiD

DOI: https://doi.org/10.2196/59439
Journal volume & issue: Vol. 26
p. e59439

Abstract

Read online

BackgroundCognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field. ObjectiveThis study aimed to explore the role of large language models (LLMs) in mitigating these biases through the use of the multi-agent framework. We simulate the clinical decision-making processes through multi-agent conversation and evaluate its efficacy in improving diagnostic accuracy compared with humans. MethodsA total of 16 published and unpublished case reports where cognitive biases have resulted in misdiagnoses were identified from the literature. In the multi-agent framework, we leveraged GPT-4 (OpenAI) to facilitate interactions among different simulated agents to replicate clinical team dynamics. Each agent was assigned a distinct role: (1) making the final diagnosis after considering the discussions, (2) acting as a devil’s advocate to correct confirmation and anchoring biases, (3) serving as a field expert in the required medical subspecialty, (4) facilitating discussions to mitigate premature closure bias, and (5) recording and summarizing findings. We tested varying combinations of these agents within the framework to determine which configuration yielded the highest rate of correct final diagnoses. Each scenario was repeated 5 times for consistency. The accuracy of the initial diagnoses and the final differential diagnoses were evaluated, and comparisons with human-generated answers were made using the Fisher exact test. ResultsA total of 240 responses were evaluated (3 different multi-agent frameworks). The initial diagnosis had an accuracy of 0% (0/80). However, following multi-agent discussions, the accuracy for the top 2 differential diagnoses increased to 76% (61/80) for the best-performing multi-agent framework (Framework 4-C). This was significantly higher compared with the accuracy achieved by human evaluators (odds ratio 3.49; P=.002). ConclusionsThe multi-agent framework demonstrated an ability to re-evaluate and correct misconceptions, even in scenarios with misleading initial investigations. In addition, the LLM-driven, multi-agent conversation framework shows promise in enhancing diagnostic accuracy in diagnostically challenging medical scenarios.

Published in Journal of Medical Internet Research

ISSN: 1438-8871 (Online)
Publisher: JMIR Publications
Country of publisher: Canada
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Medicine: Public aspects of medicine
Website: https://www.jmir.org

About the journal