Journal of Medical Internet Research (Aug 2024)

Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study

  • Kevin Danis Li,
  • Adrian M Fernandez,
  • Rachel Schwartz,
  • Natalie Rios,
  • Marvin Nathaniel Carlisle,
  • Gregory M Amend,
  • Hiren V Patel,
  • Benjamin N Breyer

DOI
https://doi.org/10.2196/56500
Journal volume & issue
Vol. 26
p. e56500

Abstract

Read online

BackgroundLarge language models including GPT-4 (OpenAI) have opened new avenues in health care and qualitative research. Traditional qualitative methods are time-consuming and require expertise to capture nuance. Although large language models have demonstrated enhanced contextual understanding and inferencing compared with traditional natural language processing, their performance in qualitative analysis versus that of humans remains unexplored. ObjectiveWe evaluated the effectiveness of GPT-4 versus human researchers in qualitative analysis of interviews with patients with adult-acquired buried penis (AABP). MethodsQualitative data were obtained from semistructured interviews with 20 patients with AABP. Human analysis involved a structured 3-stage process—initial observations, line-by-line coding, and consensus discussions to refine themes. In contrast, artificial intelligence (AI) analysis with GPT-4 underwent two phases: (1) a naïve phase, where GPT-4 outputs were independently evaluated by a blinded reviewer to identify themes and subthemes and (2) a comparison phase, where AI-generated themes were compared with human-identified themes to assess agreement. We used a general qualitative description approach. ResultsThe study population (N=20) comprised predominantly White (17/20, 85%), married (12/20, 60%), heterosexual (19/20, 95%) men, with a mean age of 58.8 years and BMI of 41.1 kg/m2. Human qualitative analysis identified “urinary issues” in 95% (19/20) and GPT-4 in 75% (15/20) of interviews, with the subtheme “spray or stream” noted in 60% (12/20) and 35% (7/20), respectively. “Sexual issues” were prominent (19/20, 95% humans vs 16/20, 80% GPT-4), although humans identified a wider range of subthemes, including “pain with sex or masturbation” (7/20, 35%) and “difficulty with sex or masturbation” (4/20, 20%). Both analyses similarly highlighted “mental health issues” (11/20, 55%, both), although humans coded “depression” more frequently (10/20, 50% humans vs 4/20, 20% GPT-4). Humans frequently cited “issues using public restrooms” (12/20, 60%) as impacting social life, whereas GPT-4 emphasized “struggles with romantic relationships” (9/20, 45%). “Hygiene issues” were consistently recognized (14/20, 70% humans vs 13/20, 65% GPT-4). Humans uniquely identified “contributing factors” as a theme in all interviews. There was moderate agreement between human and GPT-4 coding (κ=0.401). Reliability assessments of GPT-4’s analyses showed consistent coding for themes including “body image struggles,” “chronic pain” (10/10, 100%), and “depression” (9/10, 90%). Other themes like “motivation for surgery” and “weight challenges” were reliably coded (8/10, 80%), while less frequent themes were variably identified across multiple iterations. ConclusionsLarge language models including GPT-4 can effectively identify key themes in analyzing qualitative health care data, showing moderate agreement with human analysis. While human analysis provided a richer diversity of subthemes, the consistency of AI suggests its use as a complementary tool in qualitative research. With AI rapidly advancing, future studies should iterate analyses and circumvent token limitations by segmenting data, furthering the breadth and depth of large language model–driven qualitative analyses.