IEEE Access (Jan 2022)
Malware Detection by Control-Flow Graph Level Representation Learning With Graph Isomorphism Network
Abstract
With society’s increasing reliance on computer systems and network technology, the threat of malicious software grows more and more serious. In the field of information security, malware detection has been a key problem that academia and industry are committed to solving. Machine learning is an effective method for processing large-scale data, such as the Gradient Boosting Decision Tree (GBDT) and deep neural network technology. Although these types of detection methods can deal with cyber threats, most feature extraction methods are based on the statistical information features of portable executable (PE) files and thus lack the decompiled code and execution flow structure of the PE samples. Therefore, we propose a Control-Flow Graph (CFG)- and Graph Isomorphism Network (GIN)-based malware classification system. The feature vectors of CFG basic blocks are generated using the large-scale pre-trained language model MiniLM, which is beneficial for the GIN to further learn and compress the CFG-based representation, and classified with multi-layer perceptron. In addition, we evaluated the effectiveness of the representation under different dimensions and classifiers. To evaluate our method, we set up a CFG-based malware detection graph dataset from a PE file of the Blue Hexagon Open Dataset for Malware Analysis (BODMAS), which we call the Malware Geometric Binary Dataset (MGD-BINARY) and collected the experimental results of CFG representation in different dimensions and classifier settings. The evaluation results show that our proposal has proved an Accuracy metric of 0.99160 and achieved 0.99148 Area Under the Curve (AUC) results.
Keywords