Complex & Intelligent Systems (Mar 2024)

Knowledge distillation based on projector integration and classifier sharing

  • Guanpeng Zuo,
  • Chenlu Zhang,
  • Zhe Zheng,
  • Wu Zhang,
  • Ruiqing Wang,
  • Jingqi Lu,
  • Xiu Jin,
  • Zhaohui Jiang,
  • Yuan Rao

DOI
https://doi.org/10.1007/s40747-024-01394-3
Journal volume & issue
Vol. 10, no. 3
pp. 4521 – 4533

Abstract

Read online

Abstract Knowledge distillation can transfer the knowledge from the pre-trained teacher model to the student model, thus effectively accomplishing model compression. Previous studies have carefully crafted knowledge representation, targeting loss function design, and distillation location selection, but there have been few studies on the role of classifiers in distillation. Previous experiences have shown that the final classifier of the model has an essential role in making inferences, so this paper attempts to narrow the gap in performance between models by having the student model directly use the classifier of the teacher model for the final inference, which requires an additional projector to help match features of the student encoder with the teacher's classifier. However, a single projector cannot fully align the features, and integrating multiple projectors may result in better performance. Considering the balance between projector size and performance, through experiments, we obtain the size of projectors for different network combinations and propose a simple method for projector integration. In this way, the student model undergoes feature projection and then uses the classifiers of the teacher model for inference, obtaining a similar performance to the teacher model. Through extensive experiments on the CIFAR-100 and Tiny-ImageNet datasets, we show that our approach applies to various teacher–student frameworks simply and effectively.

Keywords