Autonomous medical evaluation for guideline adherence of large language models

Dennis Fast; Lisa C. Adams; Felix Busch; Conor Fallon; Marc Huppertz; Robert Siepmann; Philipp Prucker; Nadine Bayerl; Daniel Truhn; Marcus Makowski; Alexander Löser; Keno K. Bressem

doi:10.1038/s41746-024-01356-6

npj Digital Medicine (Dec 2024)

Autonomous medical evaluation for guideline adherence of large language models

Dennis Fast,
Lisa C. Adams,
Felix Busch,
Conor Fallon,
Marc Huppertz,
Robert Siepmann,
Philipp Prucker,
Nadine Bayerl,
Daniel Truhn,
Marcus Makowski,
Alexander Löser,
Keno K. Bressem

Affiliations

Dennis Fast: DATEXIS, Berliner Hochschule für Technik (BHT)
Lisa C. Adams: Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital
Felix Busch: Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital
Conor Fallon: DATEXIS, Berliner Hochschule für Technik (BHT)
Marc Huppertz: Department of Radiology, University Hospital Aachen
Robert Siepmann: Department of Radiology, University Hospital Aachen
Philipp Prucker: Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital
Nadine Bayerl: Department of Radiology, Department of Radiology, University Hospital Erlangen, Friedrich- Alexander-University (FAU) Erlangen-Nuremberg
Daniel Truhn: Department of Radiology, University Hospital Aachen
Marcus Makowski: Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital
Alexander Löser: DATEXIS, Berliner Hochschule für Technik (BHT)
Keno K. Bressem: Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital

DOI: https://doi.org/10.1038/s41746-024-01356-6
Journal volume & issue: Vol. 7, no. 1
pp. 1 – 14

Abstract

Read online

Abstract Autonomous Medical Evaluation for Guideline Adherence (AMEGA) is a comprehensive benchmark designed to evaluate large language models’ adherence to medical guidelines across 20 diagnostic scenarios spanning 13 specialties. It includes an evaluation framework and methodology to assess models’ capabilities in medical reasoning, differential diagnosis, treatment planning, and guideline adherence, using open-ended questions that mirror real-world clinical interactions. It includes 135 questions and 1337 weighted scoring elements designed to assess comprehensive medical knowledge. In tests of 17 LLMs, GPT-4 scored highest with 41.9/50, followed closely by Llama-3 70B and WizardLM-2-8x22B. For comparison, a recent medical graduate scored 25.8/50. The benchmark introduces novel content to avoid the issue of LLMs memorizing existing medical data. AMEGA’s publicly available code supports further research in AI-assisted clinical decision-making, aiming to enhance patient care by aiding clinicians in diagnosis and treatment under time constraints.

Published in npj Digital Medicine

ISSN: 2398-6352 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics
Website: https://www.nature.com/npjdigitalmed/

About the journal