Evaluating large language models for selection of statistical test for research: A pilot study

Himel Mondal; Shaikat Mondal; Prabhat Mittal

doi:10.4103/picr.picr_275_23

Perspectives in Clinical Research (Oct 2024)

Evaluating large language models for selection of statistical test for research: A pilot study

Himel Mondal,
Shaikat Mondal,
Prabhat Mittal

Affiliations

Himel Mondal
Shaikat Mondal
Prabhat Mittal

DOI: https://doi.org/10.4103/picr.picr_275_23
Journal volume & issue: Vol. 15, no. 4
pp. 178 – 182

Abstract

Read online

Background In contemporary research, selecting the appropriate statistical test is a critical and often challenging step. The emergence of large language models (LLMs) has offered a promising avenue for automating this process, potentially enhancing the efficiency and accuracy of statistical test selection. Aim This study aimed to assess the capability of freely available LLMs – OpenAI’s ChatGPT3.5, Google Bard, Microsoft Bing Chat, and Perplexity in recommending suitable statistical tests for research, comparing their recommendations with those made by human experts. Materials and Methods A total of 27 case vignettes were prepared for common research models with a question asking suitable statistical tests. The cases were formulated from previously published literature and reviewed by a human expert for their accuracy of information. The LLMs were asked the question with the case vignettes and the process was repeated with paraphrased cases. The concordance (if exactly matching the answer key) and acceptance (when not exactly matching with answer key, but can be considered suitable) were evaluated between LLM’s recommendations and those of human experts. Results Among the 27 case vignettes, ChatGPT3.5-suggested statistical test had 85.19% concordance and 100% acceptance; Bard experiment had 77.78% concordance and 96.3% acceptance; Microsoft Bing Chat had 96.3% concordance and 100% acceptance; and Perplexity had 85.19% concordance and 100% acceptance. The intra-class correction coefficient of average measure among the responses of LLMs was 0.728 (95% confidence interval [CI]: 0.51–0.86), P 75% concordance in suggesting statistical tests for research case vignettes with all having acceptance of >95%. The LLMs had a moderate level of agreement among them. While not a complete replacement for human expertise, these models can serve as effective decision support systems, especially in scenarios where rapid test selection is essential.

Published in Perspectives in Clinical Research

ISSN: 2229-3485 (Print); 2229-5488 (Online)
Publisher: Wolters Kluwer Medknow Publications
Country of publisher: India
LCC subjects: Medicine: Medicine (General)
Website: https://journals.lww.com/PICP/pages/default.aspx

About the journal

Abstract

Keywords