SAMPLE-BASED DYNAMIC HIERARCHICAL TRANSFORMER WITH LAYER AND HEAD FLEXIBILITY VIA CONTEXTUAL BANDIT

Fanfei Meng; Lele Zhang; Yu Chen; Yuxin Wang

doi:10.24874/PES06.02.001

Proceedings on Engineering Sciences (Jun 2024)

SAMPLE-BASED DYNAMIC HIERARCHICAL TRANSFORMER WITH LAYER AND HEAD FLEXIBILITY VIA CONTEXTUAL BANDIT

Fanfei Meng ,
Lele Zhang ,
Yu Chen,
Yuxin Wang

Affiliations

Fanfei Meng: ORCiD; Department of Electrical and Computer Engineering, Northwestern University, Evanston, 60208, IL, United States
Lele Zhang: ORCiD; Inistitute of Computing Technology, Chinese Academy of Science, Beijing, 100190, China
Yu Chen: ORCiD; Inistitute of Computing Technology, Chinese Academy of Science, Beijing, 100190, China
Yuxin Wang: Department of Electrical and Computer Engineering, Northwestern University, Evanston, 60208, IL, United States

DOI: https://doi.org/10.24874/PES06.02.001
Journal volume & issue: Vol. 6, no. 2
pp. 439 – 452

Abstract

Read online

Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples and expensive in training and inference. To address this, we propose a sample-based Dynamic Hierarchical Transformer (DHT) model whose layers and heads can be dynamically configured with single data samples via solving contextual bandit problems. To determine the number of layers and heads, we use the Uniform Confidence Bound algorithm while we deploy combinatorial Thompson Sampling in order to select specific head combinations given their number. Different from previous work that focuses on compressing trained networks for inference only, DHT is not only advantageous for adaptively optimizing the underlying network architecture during training but also has a flexible network for efficient inference. To the best of our knowledge, this is the first comprehensive data-driven dynamic transformer without any additional auxiliary neural networks that implement the dynamic system. According to the experiment results, we achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.

Published in Proceedings on Engineering Sciences

ISSN: 2620-2832 (Print); 2683-4111 (Online)
Publisher: University of Kragujevac
Country of publisher: Serbia
LCC subjects: Technology: Engineering (General). Civil engineering (General)
Website: http://pesjournal.net

About the journal

Abstract

Keywords