CODE-SMASH: Source-Code Vulnerability Detection Using Siamese and Multi-Level Neural Architecture

Sungmin Han; Hyunkyung Nam; Jaesik Kang; Kwangsoo Kim; Seungjae Cho; Sangkyun Lee

doi:10.1109/ACCESS.2024.3432323

IEEE Access (Jan 2024)

CODE-SMASH: Source-Code Vulnerability Detection Using Siamese and Multi-Level Neural Architecture

Sungmin Han,
Hyunkyung Nam,
Jaesik Kang,
Kwangsoo Kim,
Seungjae Cho,
Sangkyun Lee

Affiliations

Sungmin Han: ORCiD; School of Cybersecurity, Korea University, Seoul, Republic of Korea
Hyunkyung Nam: ORCiD; School of Cybersecurity, Korea University, Seoul, Republic of Korea
Jaesik Kang: ORCiD; Cyber Warfare Research and Development Laboratory, LIG Nex1, Seongnam-si, Republic of Korea
Kwangsoo Kim: ORCiD; Cyber Warfare Research and Development Laboratory, LIG Nex1, Seongnam-si, Republic of Korea
Seungjae Cho: ORCiD; Cyber Warfare Research and Development Laboratory, LIG Nex1, Seongnam-si, Republic of Korea
Sangkyun Lee: ORCiD; School of Cybersecurity, Korea University, Seoul, Republic of Korea

DOI: https://doi.org/10.1109/ACCESS.2024.3432323
Journal volume & issue: Vol. 12
pp. 102492 – 102504

Abstract

Read online

The rapid evolution of software development, propelled by competitive demands and the continuous integration of new features, frequently leads to inadvertent security oversights. Traditional security practices, often reactive in nature, primarily focus on identifying known vulnerabilities, creating a significant shortfall in detecting emergent, zero-day threats. This paper introduces CODE-SMASH, a novel deep learning-based source code vulnerability detector that utilizes a Siamese neural network with a hierarchical architecture integrating BiGRU and attention mechanisms. Our experiments using real-world datasets, specifically the Chromium and Debian datasets, demonstrate CODE-SMASH’s superiority over existing methods. It achieves significant improvements in detection performance across all key metrics, including accuracy, precision, recall, and F1-score, with average improvements of approximately 8.3%, 11.6%, 27.75%, and 17.7%, respectively, compared to the best-performing existing methods in our experiments. Moreover, CODE-SMASH shows its superior capability in handling complex and lengthy code sequences, with performance improvements for long-length code (60 to 80 lines) in F1 scores of 4.53 percentage points on the Chromium dataset and 5.62 percentage points on the Debian dataset compared to the second-best model’s performance. We believe our research makes a significant contribution to the field of automated vulnerability detection by providing a high-precision solution to the growing challenges in software security. Furthermore, based on our findings, we anticipate that future research could enhance CODE-SMASH by expanding its generalizability to various programming languages and reducing computational demands to improve efficiency.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords