IEEE Access (Jan 2024)
Enhancing Parameter Efficiency in Model Inference Using an Ultralight Inter-Transformer Linear Structure
Abstract
Pre-trained language models are the cornerstone of modern natural language processing and information retrieval. However, fine-tuning all the parameters reduces the efficiency of models both in training and inference owing to their increasingly heavy structures. Existing methods for parameter efficiency still require approximately 1 MB of storage and have approximately $10^{7}$ operations during model deployment and inference. This puts a strain on the storage and processor capacity of end devices such as smartphones and IoT equipment, and slow model inference adversely affecting the user experience. To achieve more efficient and storage-friendly inference compared to mainstream methods, such as low-rank adaptation (LoRA) and Adapter, LayerConnect (hyper-network-assisted interlayer connectors) is proposed in this paper. Extensive experiments were conducted to validate the performance of LayerConnect for two essential tasks with completely different learning frameworks and purposes: natural language understanding (using the general language understanding evaluation (GLUE) benchmark) and information retrieval (using the a contextualized inverted list (COIL) framework). For both tasks, our LayerConnect saves up to 95.31% and 91.18% of parameters in LoRA and Adapter, respectively. In contrast, LayerConnect maintains performance degradation for GLUE and COIL to less than 8% and 3%, compared to LoRA. When compared to Adapter, the numbers become 5% and 3%, for GLUE and COIL, respectively. In addition, LayerConnect required approximately 100 kB of storage per task-specific trained model for both tasks and reduced the number of operations in the model inference by four orders of magnitude, reaching approximately $10^{3}$ .
Keywords