Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model

Lanxin Zhao; Wanrong Gao; Jianbin Fang

doi:10.3390/app14062364

Applied Sciences (Mar 2024)

Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model

Lanxin Zhao,
Wanrong Gao,
Jianbin Fang

Affiliations

Lanxin Zhao: School of International Business, Hunan University of Information Technology, Changsha 410151, China
Wanrong Gao: School of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China
Jianbin Fang: School of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

DOI: https://doi.org/10.3390/app14062364
Journal volume & issue: Vol. 14, no. 6
p. 2364

Abstract

Read online

The BERT model is regarded as the cornerstone of various pre-trained large language models that have achieved promising results in recent years. This article investigates how to optimize the BERT model in terms of fine-tuning speed and prediction accuracy, aiming to accelerate the execution of the BERT model on a multi-core processor and improve its prediction accuracy in typical downstream natural language processing tasks. Our contributions are two-fold. First, we port and parallelize the fine-tuning training of the BERT model on a multi-core shared-memory processor. We port the BERT model onto a multi-core processor platform to accelerate the fine-tuning training process of the model for downstream tasks. Second, we improve the prediction performance of typical downstream natural language processing tasks through fine-tuning the model parameters. We select five typical downstream natural language processing tasks (CoLA, SST-2, MRPC, RTE, and WNLI) and perform optimization on the multi-core platform, taking the hyperparameters of batch size, learning rate, and training epochs into account. Our experimental results show that, by increasing the number of CPUs and the number of threads, the model training time can be significantly reduced. We observe that the reduced time is primarily concentrated in the self-attention mechanism. Our further experimental results show that setting reasonable hyperparameters can improve the accuracy of the BERT model when applied to downstream tasks and that appropriately increasing the batch size under conditions of sufficient computing resources can significantly reduce training time.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords