IEEE Access (Jan 2024)
Measuring and Improving the Energy Efficiency of Large Language Models Inference
Abstract
Recent improvements in the accuracy of machine learning (ML) models in the language domain have propelled their use in a multitude of products and services, touching millions of lives daily. These new levels of accuracy have been attained mainly through exponential growth in model size, creating a new category of models known as Large Language Models (LLMs) and leading to a substantial increase in computing and energy demands. While recent studies have focused on measuring and improving the energy consumption of LLMs during training, inference has received little attention. In this article, we present an approach to profile the energy consumption of LLMs during inference and leverage it to improve energy efficiency. For this, we deploy several state-of-the-art LLMs and observe how model size, number of layers, parallelized attention, and even vocabulary size affect their energy consumption. In addition, we leverage input batch size and different quantization levels to optimize their inference energy efficiency and latency.
Keywords