IEEE Access (Jan 2024)
Mixed-Precision Neural Architecture Search and Dynamic Split Point Selection for Split Computing
Abstract
Split computing (SC) is an emerging technique to perform the inference task of deep neural network (DNN) models using both mobile devices and cloud/edge servers in a hybrid manner. To improve the end-to-end inference time over the network, SC splits a single DNN model into a head model and a tail model for deployment on the mobile device and the server, respectively. A further extension of SC, referred to as dynamic SC (DSC), determines the split point dynamically depending on various network conditions such as bandwidth. This article proposes a DNN optimization approach for DSC based on mixed-precision quantization. Given a vanilla DNN model, our work optimizes the given model in two steps. First, a DSC-aware mixed-precision layer-wise quantization is performed statically via neural architecture search to generate multiple potential split points. Then a bitwidth-wise DSC algorithm is applied to dynamically select one optimal split point among the candidate points. Our evaluation on the EfficientNet-B0 and EfficientNet-B3 architectures demonstrated that our work provides more effective split points than existing quantization works while mitigating the degradation of inference accuracy. In terms of the end-to-end inference time, on the EfficientNet-B0 (B3) architecture, our work obtained relative average and maximum gains of 9.12% (4.05%) and 27.49% (12.42%), respectively, over a state-of-the-art mix-precision quantization work while achieving comparable accuracy.
Keywords