Discover Artificial Intelligence (May 2025)
Large language models and questions from older adults: a human and machine-based evaluation study
Abstract
Abstract Large language models (LLMs) hold the potential to offer substantial advantages in information generation and comprehension. This study seeks to evaluate the extent to which these models can effectively meet the needs of older adults by examining responses to 23 questions posed to ChatGPT, Google Gemini, Claude AI, Microsoft Copilot, and Google Search. The responses were evaluated based on their accuracy, comprehensibility, relevance, and conciseness. Both the LLMs and a panel of seven researchers assessed the answers. ChatGPT received the highest ratings for accuracy and comprehensibility, Google Gemini for conciseness, and both ChatGPT and Claude AI were rated highest for reliability. These ratings were further analysed to compare the performance of the LLMs with that of the researchers. The LLMs generally awarded higher ratings of 4 or 5 most of the time whereas the ratings of the researchers were more varied. Microsoft Copilot most closely aligned with the researchers’ evaluations of accuracy and comprehensibility, while Claude AI and ChatGPT showed the closest alignment for conciseness and relevance, respectively. Furthermore, to identify which platform may be best suited for different types of information, the questions were divided into five categories, with ChatGPT emerging as the best suited LLM in most categories. These findings, along with the rubric and research methodologies utilised in this study, can be replicated to assess the performance of LLMs across different research areas and domains. Graphical Abstract
Keywords