Evaluating the performance of ChatGPT and GPT-4o in coding classroom discourse data: A study of synchronous online mathematics instruction

Simin Xu; Xiaowei Huang; Chung Kwan Lo; Gaowei Chen; Morris Siu-yung Jong

Computers and Education: Artificial Intelligence (Dec 2024)

Evaluating the performance of ChatGPT and GPT-4o in coding classroom discourse data: A study of synchronous online mathematics instruction

Simin Xu,
Xiaowei Huang,
Chung Kwan Lo,
Gaowei Chen,
Morris Siu-yung Jong

Affiliations

Simin Xu: Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong SAR
Xiaowei Huang: Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong SAR
Chung Kwan Lo: Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong SAR; Corresponding author.
Gaowei Chen: Faculty of Education, The University of Hong Kong, Hong Kong SAR
Morris Siu-yung Jong: Department of Curriculum and Instruction & Centre for Learning Sciences and Technologies, The Chinese University of Hong Kong, Hong Kong SAR

Journal volume & issue: Vol. 7
p. 100325

Abstract

Read online

High-quality instruction is essential to facilitating student learning, prompting many professional development (PD) programmes for teachers to focus on improving classroom dialogue. However, during PD programmes, analysing discourse data is time-consuming, delaying feedback on teachers' performance and potentially impairing the programmes' effectiveness. We therefore explored the use of ChatGPT (a fine-tuned GPT-3.5 series model) and GPT-4o to automate the coding of classroom discourse data. We equipped these AI tools with a codebook designed for mathematics discourse and academically productive talk. Our dataset consisted of over 400 authentic talk turns in Chinese from synchronous online mathematics lessons. The coding outcomes of ChatGPT and GPT-4o were quantitatively compared against a human standard. Qualitative analysis was conducted to understand their coding decisions. The overall agreement between the human standard, ChatGPT output, and GPT-4o output was moderate (Fleiss's Kappa = 0.46) when classifying talk turns into major categories. Pairwise comparisons indicated that GPT-4o (Cohen's Kappa = 0.69) had better performance than ChatGPT (Cohen's Kappa = 0.33). However, at the code level, the performance of both AI tools was unsatisfactory. Based on the identified competences and weaknesses, we propose a two-stage approach to classroom discourse analysis. Specifically, GPT-4o can be employed for the initial category-level analysis, following which teacher educators can conduct a more detailed code-level analysis and refine the coding outcomes. This approach can facilitate timely provision of analytical resources for teachers to reflect on their teaching practices.

Published in Computers and Education: Artificial Intelligence

ISSN: 2666-920X (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/computers-and-education-artificial-intelligence

About the journal

Abstract

Keywords