Scientific Reports (Dec 2024)

The RepVig framework for designing use-case specific representative vignettes and evaluating triage accuracy of laypeople and symptom assessment applications

  • Marvin Kopka,
  • Hendrik Napierala,
  • Martin Privoznik,
  • Desislava Sapunova,
  • Sizhuo Zhang,
  • Markus A. Feufel

DOI
https://doi.org/10.1038/s41598-024-83844-z
Journal volume & issue
Vol. 14, no. 1
pp. 1 – 13

Abstract

Read online

Abstract Most studies evaluating symptom-assessment applications (SAAs) rely on a common set of case vignettes that are authored by clinicians and devoid of context, which may be representative of clinical settings but not of situations where patients use SAAs. Assuming the use case of self-triage, we used representative design principles to sample case vignettes from online platforms where patients describe their symptoms to obtain professional advice and compared triage performance of laypeople, SAAs (e.g., WebMD or NHS 111), and Large Language Models (LLMs, e.g., GPT-4 or Claude) on representative versus standard vignettes. We found performance differences in all three groups depending on vignette type: When using representative vignettes, accuracy was higher (OR = 1.52 to 2.00, p < .001 to .03 in binary decisions, i.e., correct or incorrect), safety was higher (OR = 1.81 to 3.41, p < .001 to .002 in binary decisions, i.e., safe or unsafe), and the inclination to overtriage was also higher (OR = 1.80 to 2.66, p < .001 to p = .035 in binary decisions, overtriage or undertriage error). Additionally, we found changed rankings of best-performing SAAs and LLMs. Based on these results, we argue that our representative vignette sampling approach (that we call the RepVig Framework) should replace the practice of using a fixed vignette set as standard for SAA evaluation studies.