IEEE Access (Jan 2024)

CODE-SMASH: Source-Code Vulnerability Detection Using Siamese and Multi-Level Neural Architecture

  • Sungmin Han,
  • Hyunkyung Nam,
  • Jaesik Kang,
  • Kwangsoo Kim,
  • Seungjae Cho,
  • Sangkyun Lee

DOI
https://doi.org/10.1109/ACCESS.2024.3432323
Journal volume & issue
Vol. 12
pp. 102492 – 102504

Abstract

Read online

The rapid evolution of software development, propelled by competitive demands and the continuous integration of new features, frequently leads to inadvertent security oversights. Traditional security practices, often reactive in nature, primarily focus on identifying known vulnerabilities, creating a significant shortfall in detecting emergent, zero-day threats. This paper introduces CODE-SMASH, a novel deep learning-based source code vulnerability detector that utilizes a Siamese neural network with a hierarchical architecture integrating BiGRU and attention mechanisms. Our experiments using real-world datasets, specifically the Chromium and Debian datasets, demonstrate CODE-SMASH’s superiority over existing methods. It achieves significant improvements in detection performance across all key metrics, including accuracy, precision, recall, and F1-score, with average improvements of approximately 8.3%, 11.6%, 27.75%, and 17.7%, respectively, compared to the best-performing existing methods in our experiments. Moreover, CODE-SMASH shows its superior capability in handling complex and lengthy code sequences, with performance improvements for long-length code (60 to 80 lines) in F1 scores of 4.53 percentage points on the Chromium dataset and 5.62 percentage points on the Debian dataset compared to the second-best model’s performance. We believe our research makes a significant contribution to the field of automated vulnerability detection by providing a high-precision solution to the growing challenges in software security. Furthermore, based on our findings, we anticipate that future research could enhance CODE-SMASH by expanding its generalizability to various programming languages and reducing computational demands to improve efficiency.

Keywords