Zhihui kongzhi yu fangzhen (Aug 2025)
Code summarization based on large model knowledge distillation
Abstract
Code summarization is a short natural language description of source code. Summaries are usually only one sentence long, but they are the primary way for developers to understand code. Recently, products based on large language models (such as ChatGPT) have demonstrated a strong ability to generate these descriptions. However, to use these tools, programmers must send their code to an untrusted third party for processing (for example, through API calls), but this method is unacceptable to many organizations. This paper presents an alternative: we use the example output generated by GPT-3.5 to train an open source model through a process related to knowledge distillation. Enabling small models (with 350 million parameters) to also be comparable to GPT-3.5 in code summarization tasks.
Keywords