Fine-tuning ChatGPT for automatic scoring

Ehsan Latif; Xiaoming Zhai

Computers and Education: Artificial Intelligence (Jun 2024)

Fine-tuning ChatGPT for automatic scoring

Ehsan Latif,
Xiaoming Zhai

Affiliations

Ehsan Latif: AI4STEM Education Center, University of Georgia, Athens, GA, USA
Xiaoming Zhai: Corresponding author.; AI4STEM Education Center, University of Georgia, Athens, GA, USA

Journal volume & issue: Vol. 6
p. 100210

Abstract

Read online

This study highlights the potential of fine-tuned ChatGPT (GPT-3.5) for automatically scoring student written constructed responses using example assessment tasks in science education. The application of ChatGPT in research and academic fields has greatly enhanced productivity and efficiency. Recent studies on ChatGPT based on OpenAI's generative model GPT-3.5 proved its superiority in predicting the natural language with high accuracy and human-like responses. GPT-3.5 has been trained over enormous online language materials such as journals and Wikipedia; however, direct usage of pre-trained GPT-3.5 is insufficient for automatic scoring as students do not utilize the same language as journals or Wikipedia, and contextual information is required for accurate scoring. All of these imply that a fine-tuning of a domain-specific model using data for specific tasks can enhance model performance. In this study, we fine-tuned GPT-3.5 on six assessment tasks with a diverse dataset of middle-school and high-school student responses and expert scoring. The six tasks comprise two multi-label and four multi-class assessment tasks. We compare the performance of fine-tuned GPT-3.5 with the fine-tuned state-of-the-art Google's generated language model, BERT. The results show that in-domain training corpora constructed from science questions and responses for BERT achieved average accuracy = 0.838, SD = 0.069. GPT-3.5 shows a remarkable average increase (9.1%) in automatic scoring accuracy (mean = 9.15, SD = 0.042) for the six tasks, p =0.001 < 0.05. Specifically, for each of the two multi-label tasks (item 1 with 5 labels; item 2 with 10 labels), GPT-3.5 achieved significantly higher scoring accuracy than BERT across all the labels, with the second item achieving a 7.1% increase. The average scoring increase for the four multi-class items for GPT-3.5 was 10.6% compared to BERT. Our study confirmed the effectiveness of fine-tuned GPT-3.5 for automatic scoring of student responses on domain-specific data in education with high accuracy. We have released fine-tuned models for public use and community engagement.

Published in Computers and Education: Artificial Intelligence

ISSN: 2666-920X (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/computers-and-education-artificial-intelligence

About the journal

Abstract

Keywords