IEEE Access (Jan 2020)
Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems
Abstract
Advances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This paper proposes low-dimensional but effective features for a malware detection system and analyzes them with tree-base ensemble models. Expert knowledge and frequency analysis are adapted for relevant feature selection from the collected data set, which contributes to fast low-dimensional feature preparation, low storage usage, and fast learning. We extract the five types of malware features represented from binary or disassembly files. Specifically, the novel WEM (Window Entropy Map) image is designed to represent malware with variable length, and the set of frequently used APIs is analyzed to shorten the processing time. To validate the effectiveness of the selected features, we compare the performance of tree-based ensemble models such as AdaBoost, XGBoost, random forest, extra trees, and rotation trees. The proposed feature can reduce the original feature dimensionality by several tens to hundreds of times and decrease the training time of ensemble models without degrading the malware detection rate when compared to the performance of the whole set of malware features. In accuracy and AUC-PRC evaluation, XGBoost is the highest in rank.
Keywords