African Journal of Health Professions Education (May 2023)

An evaluation of the inter-rater reliability in a clinical skills objective structured clinical examination

  • V de Beer,
  • J Nel,
  • FP Pieterse,
  • A Snyman,
  • G Joubert,
  • MJ Labuschagne

DOI
https://doi.org/10.7196/AJHPE.2023.v15i2.1574

Abstract

Read online

Background. An objective structured clinical examination (OSCE) is a performance-based examination used to assess health sciences students and is awell-recognised tool to assess clinical skills with or without using real patients.Objectives. To determine the inter-rater reliability of experienced and novice assessors from different clinical backgrounds on the final mark allocationsduring assessment of third-year medical students’ final OSCE at the University of the Free State.Methods. This cross-sectional analytical study included 24 assessors and 145 students. After training and written instructions, two assessors per station(urology history taking, respiratory examination and gynaecology skills assessment) each independently assessed the same student for the same skill bycompleting their individual checklists. At each station, assessors could also give a global rating mark (from 1 to 5) as an overall impression.Results. The urology history-taking station had the lowest mean score (53.4%) and the gynaecology skills station the highest (71.1%). Seven (58.3%) ofthe 12 assessor pairs differed by >5% regarding the final mark, with differences ranging from 5.2% to 12.2%. For two pairs the entire confidence interval(CI) was within the 5% range, whereas for five pairs the entire CI was outside the 5% range. Only one pair achieved substantial agreement (weightedkappa statistic 0.74 ‒ urology history taking). There was no consistency within or across stations regarding whether the experienced or novice assessorgave higher marks. For the respiratory examination and gynaecology skills stations, all pairs differed for the majority of students regarding the globalrating mark. Weighted kappa statistics indicated that no pair achieved substantial agreement regarding this mark.Conclusion. Despite previous experience, written instructions and training in the use of the checklists, differences between assessors were found inmost cases.