IEEE Access (Jan 2020)

Enhancing Notation-Based Code Cloning Method With an External-Based Identifier Model

  • Ngoc-Tu Chau,
  • Souhwan Jung

DOI
https://doi.org/10.1109/ACCESS.2020.3016943
Journal volume & issue
Vol. 8
pp. 162989 – 162998

Abstract

Read online

Code clone detection is known for solving code paradigm problem in software development. Malware analysts also apply this technique to detect whether a set of malware applications originated from the same malware family based on the similarity in their source code. Until now, existing notation-based approaches are based on standard identifier notations and generating signatures from the notations output. Specifically, for a set of words (or lexemes), the analysts apply rules to determine the type of a lexeme and label each lexeme with a token type. So far, none of the existing code clone detection models considers collecting identifiers from an external source. In this paper, we propose a novel External-based Identifier Model for code clone detection. The proposed model assumes the existence of an external source code that can be used as a supervisor for identifying and labeling a specific set of lexemes. By introducing the external-identifiers into source code detection, our model could distinguish between multiple fragments of code that share the same sequence of standard tokens. One of the study cases for our model is Android analysis where the Android Open Source Project can be used as an external source. The experiment on one millions line of Android source code has shown that our proposed solution could reduce the number of multiple code mapping to single signature situations in comparison with the traditional method. Furthermore, the experiment on code suggestion has proved that our model could reduce the suggestion step for providing faster output than a notation-based approach.

Keywords