Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

Austin Pack; Alex Barrett; Juan Escalante

Computers and Education: Artificial Intelligence (Jun 2024)

Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability

Austin Pack,
Alex Barrett,
Juan Escalante

Affiliations

Austin Pack: Faculty of Education and Social Work, Brigham Young University-Hawaii, 55-220 Kulanui Street Bldg 5, Laie, HI, 96762-1293, USA; Corresponding author.
Alex Barrett: College of Education, Florida State University, Stone Building, 114 West Call Street, Tallahassee, FL, 32306-2400, USA
Juan Escalante: Faculty of Education and Social Work, Brigham Young University-Hawaii, 55-220 Kulanui Street Bldg 5, Laie, HI, 96762-1293, USA

Journal volume & issue: Vol. 6
p. 100234

Abstract

Read online

Advancements in generative AI, such as large language models (LLMs), may serve as a potential solution to the burdensome task of essay grading often faced by language education teachers. Yet, the validity and reliability of leveraging LLMs for automatic essay scoring (AES) in language education is not well understood. To address this, we evaluated the cross-sectional and longitudinal validity and reliability of four prominent LLMs, Google's PaLM 2, Anthropic's Claude 2, and OpenAI's GPT-3.5 and GPT-4, for the AES of English language learners' writing. 119 essays taken from an English language placement test were assessed twice by each LLM, on two separate occasions, as well as by a pair of human raters. GPT-4 performed the best, demonstrating excellent intrarater reliability and good validity. All models, with the exception of GPT-3.5, improved over time in their intrarater reliability. The interrater reliability of GPT-3.5 and GPT-4, however, decreased slightly over time. These findings indicate that some models perform better than others in AES and that all models are subject to fluctuations in their performance. We discuss potential reasons for such variability, and offer suggestions for prospective avenues of research.

Published in Computers and Education: Artificial Intelligence

ISSN: 2666-920X (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/computers-and-education-artificial-intelligence

About the journal

Abstract

Keywords