Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models

Kai Zhang; Jinqiu Li; Bingqian Wang; Haoran Meng

doi:10.3390/app14209180

Applied Sciences (Oct 2024)

Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models

Kai Zhang,
Jinqiu Li,
Bingqian Wang,
Haoran Meng

Affiliations

Kai Zhang: Department of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, China
Jinqiu Li: Department of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, China
Bingqian Wang: Department of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, China
Haoran Meng: Department of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, China

DOI: https://doi.org/10.3390/app14209180
Journal volume & issue: Vol. 14, no. 20
p. 9180

Abstract

Read online

Pre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and efficient method called Autocorrelation Matrix Knowledge Distillation (AMKD), aimed at improving the performance of smaller BERT models for specific tasks and making them more applicable in practical deployment scenarios. The AMKD method effectively captures the relationships between features using the autocorrelation matrix, enabling the student model to learn not only the performance of individual features from the teacher model but also the correlations among these features. Additionally, it addresses the issue of dimensional mismatch between the hidden states of the student and teacher models. Even in cases where the dimensions are smaller, AMKD retains the essential features from the teacher model, thereby minimizing information loss. Experimental results demonstrate that BERTTINY-AMKD outperforms traditional distillation methods and baseline models, achieving an average score of 83.6% on GLUE tasks. This represents a 4.1% improvement over BERTTINY-KD and exceeds the performance of BERT4-PKD and DistilBERT4 by 2.6% and 3.9%, respectively. Moreover, despite having only 13.3% of the parameters of BERTBASE, the BERTTINY-AMKD model retains over 96.3% of the performance of the teacher model, BERTBASE.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords