IEEE Access (Jan 2024)
Benchmarking Inference of Transformer-Based Transcription Models With Clustering on Embedded GPUs
Abstract
Early awareness of inference performance ensures the feasibility of machine learning for embedded deployment. Often, ML model selection often focuses first on training performance and accuracy, with inference considered second. While prioritizing training is necessary, model inference performance is often impacted by these optimizations. Knowing whether a model will run meaningfully faster or be more energy efficient than its un-optimized-counterpart is required for resource-constrained embedded environments. Training-time optimizations may incur real performance losses at inference time. This paper benchmarks the effect of one such training optimization, clustered attention, and examines its effect on the inference performance of the transformer-based transcription model wav2vec2. The execution time and energy consumption of this model is evaluated on NVIDIA Jetson-embedded GPU devices. Clustered attention is specific to the transformer self-attention mechanism, which exhibits poor memory and execution-time scaling with respect to a variable input size. These scaling characteristics make it a potentially critical bottleneck that must be observed under realistic conditions. Our research considers three model variants: reference (original), clustered, and improved-clustered attention models. The reference model was faster for small input sizes, but the clustered model was faster for input sizes longer than 10 seconds. The clustered model had the lowest maximum energy per inference for input sizes greater than about 12 seconds of audio. With optimal configuration, the improved-clustered model takes 26.34% more time to execute than the reference model. With these operational differences, we show that inference performance and energy consumption of deployment should not be overlooked in selecting model optimizations.
Keywords