Computers and Education: Artificial Intelligence (Jun 2025)

Evaluating the capability of large language models in characterising relational feedback: A comparative analysis of prompting strategies

  • Wei Dai,
  • Yixin Cheng,
  • Ahmad Ari Aldino,
  • Yi-Shan Tsai,
  • Dragan Gašević,
  • Guanliang Chen

DOI
https://doi.org/10.1016/j.caeai.2025.100427
Journal volume & issue
Vol. 8
p. 100427

Abstract

Read online

Relational feedback is increasingly recognised for its crucial role in enhancing student-instructor relationships and promoting the assimilation of feedback. Despite its significance, no studies have tried to develop automated methods to analyse written feedback for properties of relational feedback to promote its use at scale and assist feedback providers with their relational feedback practices. This automated analysis of relational feedback can be performed as a classification task. However, traditional machine and deep learning methods for text classification typically require extensive human labelling and pose a significant challenge for educators and researchers lacking machine learning and data science expertise. Large language models offer a promising solution due to their advancements in text classification tasks and their capacity to interact using natural-language prompts. Prompting strategies can significantly influence model performance; however, it remains unclear how the prompt should be designed to enable the accurate characterisation of relational feedback. Therefore, this study aims to investigate the capability of GPT-4o, the versatile and flagship model by OpenAI, in characterising relational feedback and evaluate how its effectiveness varies across zero-shot, one-shot and few-shot prompting strategies. Results from extensive experiments conducted on a real-world dataset comprising 793 feedback sentences revealed that: i) GPT-4o achieved an average Accuracy exceeding 0.8 in identifying nine out of ten relational characteristics, and an average F1 score exceeding 0.7 in identifying six out of ten relational feedback characteristics; ii) GPT-4o's classification performance demonstrated no significant differences across various prompting strategies for eight out of ten relational characteristics; iii) GPT-4o's classification performance was enhanced by explicitly distinguishing between related relational characteristics within the prompt. Our findings underscore the potential of large language models in identifying relational feedback and indicate that providing a clear definition of relational characteristics enhances classification performance more effectively than incorporating exemplars in the prompt.

Keywords