DB-CBIL: A DistilBert-Based Transformer Hybrid Model Using CNN and BiLSTM for Software Vulnerability Detection

Ahmed Bahaa; Aya Kamal; Hanan Fahmy; Amr S. Ghoneim

doi:10.1109/ACCESS.2024.3396410

IEEE Access (Jan 2024)

DB-CBIL: A DistilBert-Based Transformer Hybrid Model Using CNN and BiLSTM for Software Vulnerability Detection

Ahmed Bahaa,
Aya Kamal,
Hanan Fahmy,
Amr S. Ghoneim

Affiliations

Ahmed Bahaa: Department of Information Systems, Faculty of Computers and Artificial Intelligence, Helwan University, Helwan, Egypt
Aya Kamal: ORCiD; Department of Information Systems, Faculty of Computers and Artificial Intelligence, Helwan University, Helwan, Egypt
Hanan Fahmy: ORCiD; Department of Information Systems, Faculty of Computers and Artificial Intelligence, Helwan University, Helwan, Egypt
Amr S. Ghoneim: ORCiD; Department of Computer Science, Faculty of Computers and Artificial Intelligence, Helwan University, Helwan, Egypt

DOI: https://doi.org/10.1109/ACCESS.2024.3396410
Journal volume & issue: Vol. 12
pp. 64446 – 64460

Abstract

Read online

Software vulnerabilities are among the significant causes of security breaches. Vulnerabilities can severely compromise software security if exploited by malicious attacks and may result in catastrophic losses. Hence, Automatic vulnerability detection methods promise to mitigate attack risks and safeguard software security. This paper introduces a novel model for automatic vulnerability detection of source code vulnerabilities dubbed DB-CBIL using a hybrid deep learning model based on Distilled Bidirectional Encoder Representations from Transformers (DistilBERT). The proposed model considers contextualized word embeddings using the language model for the syntax and semantics of source code functions based on the Abstract Syntax Tree (AST) representation. The model includes two main phases. First, using a vulnerable code dataset, the pre-trained DistilBert transformer is fine-tuned for word embedding. Second, a hybrid deep learning model detects which code functions are vulnerable. The hybrid model is built on two Deep Neural Networks (DNN). The first model is the Convolutional Neural Network (CNN), which is used for extracting features. The second model is Bidirectional-LSTM (BiLSTM), which has been used to maintain the sequential order of the data as it can handle lengthy token sequences. The utilized source code dataset is derived from the Software Assurance Reference Database (SARD) benchmark dataset. Final experimental findings show that the proposed model outperforms the state-of-the-art approaches’ performance by improving precision, recall, F1-score, and False Negative Rate (FNR) by 2.41%-8.95%, 4.0%-16.28%, 1.85%-12.74%, and 18% respectively. The proposed model reports the lowest FNR in the literature, a significant achievement given the cost-based nature of vulnerability detectors.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords