Jisuanji kexue yu tansuo (Oct 2020)
Code Clone Detection Based on Program Vector Tree
Abstract
Code cloning facilitates software development but also causes recurring bugs and software quality problems. Some types of code clones have very low similarity in literal, leading to difficulty of detection. Aiming at this pro-blem, this paper proposes one method of code clone detection based on the program vector tree. First, the feature representations of lexical units are extracted based on a statistical language model and the semantic similarities between different literal words are analyzed. Second, the abstract syntax tree (AST) of each program is extracted by syntactical analysis, and each AST is transformed into a program vector tree with each leaf node assigned a feature representation of the corresponding literal word. Finally, one weighted encoding mechanism is proposed for encoding each program vector tree into a fixed-sized vector, considering different weight information of nodes in the tree, and code fragments with similar vector representations are reported as code clones. Experimental results on BigClone-Bench, an existing large benchmark of real code clones, show that this method outperforms many prominent clone detection methods, including NiCad, Deckard, SourcererCC and Oreo, etc., in detecting Moderately Type-3 or Type-4 clones that have low similarity in literal, which verifies the validity of this method.
Keywords