AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

Qinhong Zhou; Peng Li; Yang Liu; Yuyang Guan; Qizhou Xing; Ming Chen; Maosong Sun; Yang Liu

AI Open (Jan 2023)

AdaDS: Adaptive data selection for accelerating pre-trained language model knowledge distillation

Qinhong Zhou,
Peng Li,
Yang Liu,
Yuyang Guan,
Qizhou Xing,
Ming Chen,
Maosong Sun,
Yang Liu

Affiliations

Qinhong Zhou: Department of Computer Science and Technology, Tsinghua University, China
Peng Li: Institute for AI Industry Research, Tsinghua University, China; Corresponding author.
Yang Liu: Institute for AI Industry Research, Tsinghua University, China
Yuyang Guan: Beijing Sinovoice Technology Co., Ltd., China
Qizhou Xing: Beijing Sinovoice Technology Co., Ltd., China
Ming Chen: Beijing Sinovoice Technology Co., Ltd., China
Maosong Sun: Department of Computer Science and Technology, Tsinghua University, China
Yang Liu: Department of Computer Science and Technology, Tsinghua University, China; Institute for AI Industry Research, Tsinghua University, China

Journal volume & issue: Vol. 4
pp. 56 – 63

Abstract

Read online

Knowledge distillation (KD) is a widely used method for transferring knowledge from large teacher models to computationally efficient student models. Unfortunately, the computational cost of KD becomes unaffordable as pre-trained language models (PLMs) grow larger. Computing KD loss on only part of the training set is a promising way to accelerate KD. However, existing works heuristically leverage only one static data selection strategy during the KD process, demonstrating inconsistent improvements across different distillation scenarios. In this work, we conduct a thorough study on various typical data selection strategies for KD, and show that this problem is due to the fact that the best data selection strategy is specific to various factors, including task, selected data size, and training stage. To automatically adapt to these factors, we propose a framework named AdaDS to learn to choose the data selection strategy adaptively during the KD process. Experimental results show that our proposed method is effective for various tasks and selected data sizes under both fine-tuning and pre-training stages, achieving comparable performance to DistilBERT with only 10% amount of queries to the teacher model.

Published in AI Open

ISSN: 2666-6510 (Online)
Publisher: KeAi Communications Co. Ltd.
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://www.keaipublishing.com/en/journals/ai-open/

About the journal

Abstract

Keywords