Memory-Replay Knowledge Distillation

Jiyue Wang; Pei Zhang; Yanxiong Li

doi:10.3390/s21082792

Sensors (Apr 2021)

Memory-Replay Knowledge Distillation

Jiyue Wang,
Pei Zhang,
Yanxiong Li

Affiliations

Jiyue Wang: School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510641, China
Pei Zhang: School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
Yanxiong Li: School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510641, China

DOI: https://doi.org/10.3390/s21082792
Journal volume & issue: Vol. 21, no. 8
p. 2792

Abstract

Read online

Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback–Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD distills its own knowledge to achieve better performance. The role of the teacher in self-KD is usually played by multi-branch peers or the identical sample with different augmentation. However, the mentioned self-KD methods above have their limitation for widespread use. The former needs to redesign the DNN for different tasks, and the latter relies on the effectiveness of the augmentation method. To avoid the limitation above, we propose a new self-KD method, Memory-replay Knowledge Distillation (MrKD), that uses the historical models as teachers. Firstly, we propose a novel self-KD training method that penalizes the KD loss between the current model’s output distributions and its backup outputs on the training trajectory. This strategy can regularize the model with its historical output distribution space to stabilize the learning. Secondly, a simple Fully Connected Network (FCN) is applied to ensemble the historical teacher’s output for a better guidance. Finally, to ensure the teacher outputs offer the right class as ground truth, we correct the teacher logit output by the Knowledge Adjustment (KA) method. Experiments on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset DCASE) classification tasks show that MrKD improves single model training and working efficiently across different datasets. In contrast to the existing fancy self-KD methods with various external knowledge, the effectiveness of MrKD sheds light on the usually abandoned historical models during the training trajectory.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords