Computers and Education: Artificial Intelligence (Dec 2024)
Evaluating the performance of ChatGPT and GPT-4o in coding classroom discourse data: A study of synchronous online mathematics instruction
Abstract
High-quality instruction is essential to facilitating student learning, prompting many professional development (PD) programmes for teachers to focus on improving classroom dialogue. However, during PD programmes, analysing discourse data is time-consuming, delaying feedback on teachers' performance and potentially impairing the programmes' effectiveness. We therefore explored the use of ChatGPT (a fine-tuned GPT-3.5 series model) and GPT-4o to automate the coding of classroom discourse data. We equipped these AI tools with a codebook designed for mathematics discourse and academically productive talk. Our dataset consisted of over 400 authentic talk turns in Chinese from synchronous online mathematics lessons. The coding outcomes of ChatGPT and GPT-4o were quantitatively compared against a human standard. Qualitative analysis was conducted to understand their coding decisions. The overall agreement between the human standard, ChatGPT output, and GPT-4o output was moderate (Fleiss's Kappa = 0.46) when classifying talk turns into major categories. Pairwise comparisons indicated that GPT-4o (Cohen's Kappa = 0.69) had better performance than ChatGPT (Cohen's Kappa = 0.33). However, at the code level, the performance of both AI tools was unsatisfactory. Based on the identified competences and weaknesses, we propose a two-stage approach to classroom discourse analysis. Specifically, GPT-4o can be employed for the initial category-level analysis, following which teacher educators can conduct a more detailed code-level analysis and refine the coding outcomes. This approach can facilitate timely provision of analytical resources for teachers to reflect on their teaching practices.