Challenging large language models’ “intelligence” with human tools: A neuropsychological investigation in Italian language on prefrontal functioning

Riccardo Loconte; Graziella Orrù; Mirco Tribastone; Pietro Pietrini; Giuseppe Sartori

Heliyon (Oct 2024)

Challenging large language models’ “intelligence” with human tools: A neuropsychological investigation in Italian language on prefrontal functioning

Riccardo Loconte,
Graziella Orrù,
Mirco Tribastone,
Pietro Pietrini,
Giuseppe Sartori

Affiliations

Riccardo Loconte: Molecular Mind Lab, IMT School of Advanced Studies Lucca, Lucca, Italy; Corresponding author. IMT School of Advanced Studies Lucca, Piazza San Francesco 19, Lucca, LU, 55100, Italy.
Graziella Orrù: University of Pisa, Pisa, Italy
Mirco Tribastone: Molecular Mind Lab, IMT School of Advanced Studies Lucca, Lucca, Italy
Pietro Pietrini: Molecular Mind Lab, IMT School of Advanced Studies Lucca, Lucca, Italy
Giuseppe Sartori: Department of General Psychology, University of Padova, Padova, Italy

Journal volume & issue: Vol. 10, no. 19
p. e38911

Abstract

Read online

The Artificial Intelligence (AI) research community has used ad-hoc benchmarks to measure the “intelligence” level of Large Language Models (LLMs). In humans, intelligence is closely linked to the functional integrity of the prefrontal lobes, which are essential for higher-order cognitive processes. Previous research has found that LLMs struggle with cognitive tasks that rely on these prefrontal functions, highlighting a significant challenge in replicating human-like intelligence. In December 2022, OpenAI released ChatGPT, a new chatbot based on the GPT-3.5 model that quickly gained popularity for its impressive ability to understand and respond to human instructions, suggesting a significant step towards intelligent behaviour in AI. Therefore, to rigorously investigate LLMs' level of “intelligence,” we evaluated the GPT-3.5 and GPT-4 versions through a neuropsychological assessment using tests in the Italian language routinely employed to assess prefrontal functioning in humans. The same tests were also administered to Claude2 and Llama2 to verify whether similar language models perform similarly in prefrontal tests. When using human performance as a reference, GPT-3.5 showed inhomogeneous results on prefrontal tests, with some tests well above average, others in the lower range, and others frankly impaired. Specifically, we have identified poor planning abilities and difficulty in recognising semantic absurdities and understanding others' intentions and mental states. Claude2 exhibited a similar pattern to GPT-3.5, while Llama2 performed poorly in almost all tests. These inconsistent profiles highlight how LLMs' emergent abilities do not yet mimic human cognitive functioning. The sole exception was GPT-4, which performed within the normative range for all the tasks except planning. Furthermore, we showed how standardised neuropsychological batteries developed to assess human cognitive functions may be suitable for challenging LLMs’ performance.

Published in Heliyon

ISSN: 2405-8440 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General); Social Sciences: Social sciences (General)
Website: https://www.cell.com/heliyon/home

About the journal

Abstract

Keywords