Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning

Ramya Jonnala; Jeong Yang; Young Lee; Gongbo Liang; Zechun Cao

doi:10.1109/ACCESS.2025.3585742

IEEE Access (Jan 2025)

Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning

Ramya Jonnala,
Jeong Yang,
Young Lee,
Gongbo Liang,
Zechun Cao

Affiliations

Ramya Jonnala: ORCiD; Department of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USA
Jeong Yang: ORCiD; Department of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USA
Young Lee: ORCiD; Department of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USA
Gongbo Liang: ORCiD; Department of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USA
Zechun Cao: ORCiD; Department of Computational, Engineering, and Mathematical Sciences, Texas A&M University-San Antonio, San Antonio, TX, USA

DOI: https://doi.org/10.1109/ACCESS.2025.3585742
Journal volume & issue: Vol. 13
pp. 119657 – 119681

Abstract

Read online

The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords