Telematics and Informatics Reports (Mar 2024)
A cross-sectional study to assess response generated by ChatGPT and ChatSonic to patient queries about Epilepsy
Abstract
Objective: This article presents a study comparing the responses of two AI chatbots, ChatGPT and ChatSonic, regarding inquiries about epilepsy. Overall, ChatGPT and ChatSonic are very similar in terms of their capabilities and limitations and they are the most widely used AI software. However, there are some key differences, such as their training data, supported languages, and pricing model. The study aims to assess the potential application of AI in patient counseling and decision-making regarding epilepsy treatment. Methods: The study categorized the inquiries of patients about epilepsy into two groups: patient counseling and judgment. Ten questions were formulated within these categories. Two specialized doctors evaluated the reliability and accuracy of the chatbot replies using the Global Quality Scale (GQS) and a modified version of the DISCERN score. Results: The median value for GQS of 4.5 was given by Evaluator JC, and a median value for GQS of 4.0 was given by Evaluator VV. Furthermore, a median for RS of 5.0 was given by Evaluator JC, and a median for RS of 4.0 was given by Evaluator VV. The GQS data from Evaluators JC and VV have a Spearman correlation coefficient of -0.531, indicating an inversely proportional association, and a p-value of 0.016, indicating a statistically significant relationship between the variables. However, the correlation coefficient of RS between data by Evaluator JC and Evaluator VV is 0.368 which indicates the correlation is a directly proportional relationship, with a p-value of 0.110 which is not statistically significant, does not establish a relation between the variables. Weighted Kappa was used to study the agreement between the data. With a weighted kappa value of -0.318 and a 95 %CI of -0.570, -0.065 was obtained for GQS. This can help reject the null hypothesis indicating that the values by the Evaluator JC and Evaluator VV are statistically significant and has a negative agreement. However, a weighted kappa value of 0.1327 with a 95 %CI of -0.093, 0.359 obtained for RS, fails to reject the null hypothesis indicating that the values by the Evaluator JC and Evaluator VV are not significant and no agreement exists between the Evaluators. The results of this study suggest that both ChatGPT and ChatSonic have the potential to be valuable tools for epilepsy patients and their healthcare providers. However, it is important to note that the two evaluators had better agreement on the GQS scores than on the RS scores, suggesting that the GQS may be a more reliable measure of the quality of chatbot responses. Conclusion: The findings underscore the importance of collaboration among policymakers, healthcare professionals, and AI designers to ensure appropriate and safe utilization of AI chatbots in the healthcare domain. While AI chatbots can provide valuable information, it is crucial to acknowledge their limitations, including reliance on the training data and occasional factual errors. The study highlights the need for further testing and validation of AI language models in the management of epilepsy as it concludes.