Applied Sciences (Oct 2020)

A Source Code Similarity Based on Siamese Neural Network

  • Chunli Xie,
  • Xia Wang,
  • Cheng Qian,
  • Mengqi Wang

DOI
https://doi.org/10.3390/app10217519
Journal volume & issue
Vol. 10, no. 21
p. 7519

Abstract

Read online

Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.

Keywords