IEEE Access (Jan 2024)
A Generative AI-Driven Method-Level Semantic Clone Detection Based on the Structural and Semantical Comparison of Methods
Abstract
Code fragments with identical or similar functionality are called code clones. This study aims to detect semantic clones in Java-based programs, focusing on the method-level granularity. To accomplish this, our approach extracts and compares both semantical and structural features of Java methods and applies certain heuristics to obtain the final result set. The semantics of a method are often described by its documentation, while its structural details are characterized by its implementation within the code body. However, Java methods may not always be accompanied by descriptive documentation. To address this, we have employed a Generative AI tool named ChatGPT, which has the ability to understand the given code and generate its documentation. To compare the documentation of methods, we have used various corpus-based and knowledge-based information retrieval (IR) techniques. However, for comparing a method’s structural details, we tokenized the method’s body and applied an information retrieval technique called VSM (Vector Space Modelling) on the tokenized body. Additionally, we also extracted and compared certain method-level metrics for this purpose. Further, while doing the semantical comparison of methods, eight IR variants are formed based on the internal processing requirements of different IR techniques. The technique proposed in this paper relies on the textual and metrics-based analysis of Java programs with few parsing requirements, making it lightweight and less computation-intensive. To examine the efficiency of the proposed technique, we have validated it using a semantic clone benchmark. The results show that the proposed technique detects semantic clones with high recall values ranging from 67% to 81% and precision values ranging from 60% to 96% for different IR variants explored in our research.
Keywords