Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis

Idrees A. Zahid; Shahad Sabbar Joudar; A.S. Albahri; O.S. Albahri; A.H. Alamoodi; Jose Santamaría; Laith Alzubaidi

Intelligent Systems with Applications (Sep 2024)

Unmasking large language models by means of OpenAI GPT-4 and Google AI: A deep instruction-based analysis

Idrees A. Zahid,
Shahad Sabbar Joudar,
A.S. Albahri,
O.S. Albahri,
A.H. Alamoodi,
Jose Santamaría,
Laith Alzubaidi

Affiliations

Idrees A. Zahid: University of Technology, Baghdad, Iraq
Shahad Sabbar Joudar: University of Technology, Baghdad, Iraq
A.S. Albahri: Technical College, Imam Ja'afar Al-Sadiq University, Baghdad, Iraq
O.S. Albahri: Australian Technical and Management College, Melbourne, Australia; Computer Techniques Engineering Department, Mazaya University College, Nasiriyah, Iraq
A.H. Alamoodi: Applied Science Research Center, Applied Science Private University, Amman, Jordan; MEU Research Unit, Middle East University, Amman, Jordan
Jose Santamaría: Department of Computer Science, University of Jaén, 23071, Jaén, Spain
Laith Alzubaidi: School of Mechanical, Medical, and Process Engineering, Queensland University of Technology, Brisbane, 4000, QLD, Australia; Centre for Data Science, Queensland University of Technology, Brisbane, QLD, 4000, Australia; Corresponding author.

Journal volume & issue: Vol. 23
p. 200431

Abstract

Read online

Large Language Models (LLMs) have become a hot topic in AI due to their ability to mimic human conversation. This study compares the open artificial intelligence generative pretrained transformer-4 (GPT-4) model, based on the (GPT), and Google's artificial intelligence (AI), which is based on the Bidirectional Encoder Representations from Transformers (BERT) framework in terms of the defined capabilities and the built-in architecture. Both LLMs are prominent in AI applications. First, eight different capabilities were identified to evaluate these models, i.e. translation accuracy, text generation, factuality, creativity, intellect, deception avoidance, sentiment classification, and sarcasm detection. Next, each capability was assessed using instructions. Additionally, a categorized LLM evaluation system has been developed by means of using ten research questions per category based on this paper's main contributions from a prompt engineering perspective. It should be highlighted that GPT-4 and Google AI successfully answered 85 % and 68,7 % of the study prompts, respectively. It has been noted that GPT-4 better understands prompts than Google AI, even with verbal flaws, and tolerates grammatical errors. Moreover, the GPT-4 based approach was more precise, accurate, and succinct than Google AI, which was sometimes verbose and less realistic. While GPT-4 beats Google AI in terms of translation accuracy, text generation, factuality, intellectuality, creativity, and deception avoidance, Google AI outperforms the former when considering sarcasm detection. Both sentiment classification models did work properly. More importantly, a human panel of judges was used to assess and evaluate the model comparisons. Statistical analysis of the judges' ratings revealed more robust results based on examining the specific uses, limitations, and expectations of both GPT-4 and Google AI-based approaches. Finally, the two approaches' transformers, parameter sizes, and attention mechanisms have been examined.

Published in Intelligent Systems with Applications

ISSN: 2667-3053 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Science: Science (General): Cybernetics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.journals.elsevier.com/intelligent-systems-with-applications

About the journal

Abstract

Keywords