IEEE Access (Jan 2021)

A Multi-Module Based Method for Generating Natural Language Descriptions of Code Fragments

  • Xuejian Gao,
  • Xue Jiang,
  • Qiong Wu,
  • Xiao Wang,
  • Lei Lyu,
  • Chen Lyu

DOI
https://doi.org/10.1109/ACCESS.2021.3055955
Journal volume & issue
Vol. 9
pp. 21579 – 21592

Abstract

Read online

Code fragment natural language description generation, also known as code summarization, refers to obtaining a natural language sequence describing a given code fragment's functionality. It is broadly agreed that applying code summarization into production can significantly improve the efficiency of software development and maintenance. In recent years, syntactic analysis (SA) technology and Latent Dirichlet Allocation (LDA) has been widely used in code summarization and has achieved good results. However, most of the existing techniques focus on core code statements, and thus their generated code summarization lacks a logical description of the code fragment's holistic information. To this end, we propose a code summarization method based on multiple modules to generate natural language for each code statement by constructing a new type of natural language template. Meanwhile, to utilize the code fragment's holistic information, we adopt the code statement partition rules and cosine similarity measure to rank and optimize the weight of the overall information of the code fragment, and finally generate the holistic natural language description of the code fragment. The experimental results demonstrate that our method can generate more concise and logical natural language descriptions than existing models.

Keywords