IEEE Access (Jan 2024)

DB-CBIL: A DistilBert-Based Transformer Hybrid Model Using CNN and BiLSTM for Software Vulnerability Detection

  • Ahmed Bahaa,
  • Aya Kamal,
  • Hanan Fahmy,
  • Amr S. Ghoneim

DOI
https://doi.org/10.1109/ACCESS.2024.3396410
Journal volume & issue
Vol. 12
pp. 64446 – 64460

Abstract

Read online

Software vulnerabilities are among the significant causes of security breaches. Vulnerabilities can severely compromise software security if exploited by malicious attacks and may result in catastrophic losses. Hence, Automatic vulnerability detection methods promise to mitigate attack risks and safeguard software security. This paper introduces a novel model for automatic vulnerability detection of source code vulnerabilities dubbed DB-CBIL using a hybrid deep learning model based on Distilled Bidirectional Encoder Representations from Transformers (DistilBERT). The proposed model considers contextualized word embeddings using the language model for the syntax and semantics of source code functions based on the Abstract Syntax Tree (AST) representation. The model includes two main phases. First, using a vulnerable code dataset, the pre-trained DistilBert transformer is fine-tuned for word embedding. Second, a hybrid deep learning model detects which code functions are vulnerable. The hybrid model is built on two Deep Neural Networks (DNN). The first model is the Convolutional Neural Network (CNN), which is used for extracting features. The second model is Bidirectional-LSTM (BiLSTM), which has been used to maintain the sequential order of the data as it can handle lengthy token sequences. The utilized source code dataset is derived from the Software Assurance Reference Database (SARD) benchmark dataset. Final experimental findings show that the proposed model outperforms the state-of-the-art approaches’ performance by improving precision, recall, F1-score, and False Negative Rate (FNR) by 2.41%-8.95%, 4.0%-16.28%, 1.85%-12.74%, and 18% respectively. The proposed model reports the lowest FNR in the literature, a significant achievement given the cost-based nature of vulnerability detectors.

Keywords