There Are Infinite Ways to Formulate Code: How to Mitigate the Resulting Problems for Better Software Vulnerability Detection

Jinghua Groppe; Sven Groppe; Daniel Senf; Ralf Möller

doi:10.3390/info15040216

Information (Apr 2024)

There Are Infinite Ways to Formulate Code: How to Mitigate the Resulting Problems for Better Software Vulnerability Detection

Jinghua Groppe,
Sven Groppe,
Daniel Senf,
Ralf Möller

Affiliations

Jinghua Groppe: Institute of Information Systems (IFIS), University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
Sven Groppe: Institute of Information Systems (IFIS), University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany
Daniel Senf: Lufthansa Industry Solutions AS GmbH, Schützenwall 1, 22844 Norderstedt, Germany
Ralf Möller: Institute of Information Systems (IFIS), University of Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany

DOI: https://doi.org/10.3390/info15040216
Journal volume & issue: Vol. 15, no. 4
p. 216

Abstract

Read online

Given a set of software programs, each being labeled either as vulnerable or benign, deep learning technology can be used to automatically build a software vulnerability detector. A challenge in this context is that there are countless equivalent ways to implement a particular functionality in a program. For instance, the naming of variables is often a matter of the personal style of programmers, and thus, the detection of vulnerability patterns in programs is made difficult. Current deep learning approaches to software vulnerability detection rely on the raw text of a program and exploit general natural language processing capabilities to address the problem of dealing with different naming schemes in instances of vulnerability patterns. Relying on natural language processing, and learning how to reveal variable reference structures from the raw text, is often too high a burden, however. Thus, approaches based on deep learning still exhibit problems generating a detector with decent generalization properties due to the naming or, more generally formulated, the vocabulary explosion problem. In this work, we propose techniques to mitigate this problem by making the referential structure of variable references explicit in input representations for deep learning approaches. Evaluation results show that deep learning models based on techniques presented in this article outperform raw text approaches for vulnerability detection. In addition, the new techniques also induce a very small main memory footprint. The efficiency gain of memory usage can be up to four orders of magnitude compared to existing methods as our experiments indicate.

Published in Information

ISSN: 2078-2489 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.mdpi.com/journal/information/

About the journal

Abstract

Keywords