Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems

Seoungyul Euh; Hyunjong Lee; Donghoon Kim; Doosung Hwang

doi:10.1109/ACCESS.2020.2986014

IEEE Access (Jan 2020)

Comparative Analysis of Low-Dimensional Features and Tree-Based Ensembles for Malware Detection Systems

Seoungyul Euh,
Hyunjong Lee,
Donghoon Kim,
Doosung Hwang

Affiliations

Seoungyul Euh: Security Technology Institute, KSign, Seoul, South Korea
Hyunjong Lee: ORCiD; Security Technology Institute, KSign, Seoul, South Korea
Donghoon Kim: Department of Computer Science, Arkansas State University, Jonesboro, AR, USA
Doosung Hwang: ORCiD; Department of Software Science, Dankook University, Yongin, South Korea

DOI: https://doi.org/10.1109/ACCESS.2020.2986014
Journal volume & issue: Vol. 8
pp. 76796 – 76808

Abstract

Read online

Advances in machine learning algorithms have improved the performance of malware detection systems for the last decade. However, there are still some challenges such as processing a large amount of malware, learning high-dimensional vectors, high storage usage, and low scalability in learning. This paper proposes low-dimensional but effective features for a malware detection system and analyzes them with tree-base ensemble models. Expert knowledge and frequency analysis are adapted for relevant feature selection from the collected data set, which contributes to fast low-dimensional feature preparation, low storage usage, and fast learning. We extract the five types of malware features represented from binary or disassembly files. Specifically, the novel WEM (Window Entropy Map) image is designed to represent malware with variable length, and the set of frequently used APIs is analyzed to shorten the processing time. To validate the effectiveness of the selected features, we compare the performance of tree-based ensemble models such as AdaBoost, XGBoost, random forest, extra trees, and rotation trees. The proposed feature can reduce the original feature dimensionality by several tens to hundreds of times and decrease the training time of ensemble models without degrading the malware detection rate when compared to the performance of the whole set of malware features. In accuracy and AUC-PRC evaluation, XGBoost is the highest in rank.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords