IEEE Access (Jan 2021)

Histogram Entropy Representation and Prototype Based Machine Learning Approach for Malware Family Classification

  • Byunghyun Baek,
  • Seoungyul Euh,
  • Dongheon Baek,
  • Donghoon Kim,
  • Doosung Hwang

DOI
https://doi.org/10.1109/ACCESS.2021.3127195
Journal volume & issue
Vol. 9
pp. 152098 – 152114

Abstract

Read online

The number of malware has steadily increased as malware spread and evasion techniques have advanced. Machine learning has contributed to making malware analysis more efficient by detecting various behavioral and evasion patterns. However, when analyzing large-scale malware datasets, malware analysis through learning models has both high temporal and spatial complexity. In order to address these problems, this work proposes a low-dimensional feature using histogram entropy and a prototype selection algorithm using hyperrectangles. The low-dimensional feature forms an $L \times 256$ map according to the preselected parameter $L$ . The prototype selection algorithm divides the input space into overlapping subspaces where each subspace is decided by its hyperrectangle that becomes a prototype in the same class. A set cover optimization algorithm is employed to select a small number of prototypes that construct a new training dataset. A set of prototypes selected by the prototype selection algorithm is used to classify malware families. The experiment compares the performance of machine learning models for the histogram entropy feature using both the BIG 2015 dataset and the collected dataset. The integrated approach is evaluated using learning algorithms, such as Decision Tree, Random Forest, XGBoost, and CNN. The experimental results indicate that learning models perform competitively when compared to the entire dataset, while the proposed selection approach benefits from smaller datasets and lower time complexity.

Keywords