Malware Detection by Control-Flow Graph Level Representation Learning With Graph Isomorphism Network

Yun Gao; Hirokazu Hasegawa; Yukiko Yamaguchi; Hajime Shimada

doi:10.1109/ACCESS.2022.3215267

IEEE Access (Jan 2022)

Malware Detection by Control-Flow Graph Level Representation Learning With Graph Isomorphism Network

Yun Gao,
Hirokazu Hasegawa,
Yukiko Yamaguchi,
Hajime Shimada

Affiliations

Yun Gao: ORCiD; Graduate School of Informatics, Nagoya University, Nagoya, Japan
Hirokazu Hasegawa: Center for Strategic Cyber Resilience Research and Development, National Institute of Informatics, Tokyo, Japan
Yukiko Yamaguchi: Information Technology Center, Nagoya University, Nagoya, Japan
Hajime Shimada: Information Technology Center, Nagoya University, Nagoya, Japan

DOI: https://doi.org/10.1109/ACCESS.2022.3215267
Journal volume & issue: Vol. 10
pp. 111830 – 111841

Abstract

Read online

With society’s increasing reliance on computer systems and network technology, the threat of malicious software grows more and more serious. In the field of information security, malware detection has been a key problem that academia and industry are committed to solving. Machine learning is an effective method for processing large-scale data, such as the Gradient Boosting Decision Tree (GBDT) and deep neural network technology. Although these types of detection methods can deal with cyber threats, most feature extraction methods are based on the statistical information features of portable executable (PE) files and thus lack the decompiled code and execution flow structure of the PE samples. Therefore, we propose a Control-Flow Graph (CFG)- and Graph Isomorphism Network (GIN)-based malware classification system. The feature vectors of CFG basic blocks are generated using the large-scale pre-trained language model MiniLM, which is beneficial for the GIN to further learn and compress the CFG-based representation, and classified with multi-layer perceptron. In addition, we evaluated the effectiveness of the representation under different dimensions and classifiers. To evaluate our method, we set up a CFG-based malware detection graph dataset from a PE file of the Blue Hexagon Open Dataset for Malware Analysis (BODMAS), which we call the Malware Geometric Binary Dataset (MGD-BINARY) and collected the experimental results of CFG representation in different dimensions and classifier settings. The evaluation results show that our proposal has proved an Accuracy metric of 0.99160 and achieved 0.99148 Area Under the Curve (AUC) results.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords