Intelligent Systems with Applications (Nov 2023)

MalHyStack: A hybrid stacked ensemble learning framework with feature engineering schemes for obfuscated malware analysis

  • Kowshik Sankar Roy,
  • Tanim Ahmed,
  • Pritom Biswas Udas,
  • Md. Ebtidaul Karim,
  • Sourav Majumdar

Journal volume & issue
Vol. 20
p. 200283

Abstract

Read online

Since the advent of malware, it has reached a toll in this world that exchanges billions of data daily. Millions of people are victims of it, and the numbers are not decreasing as the year goes by. Malware is of various types in which obfuscation is a special kind. Obfuscated malware detection is necessary as it is not usually detectable and is prevalent in the real world. Although numerous works have already been done in this field so far, most of these works still need to catch up at some points, considering the scope of exploration through recent extensions. In addition to that, the application of a hybrid classification model is yet to be popularized in this field. Thus, in this paper, a novel hybrid classification model named, MalHyStack, has been proposed for detecting such obfuscated malware within the network. This proposed working model is built incorporating a stacked ensemble learning scheme, where conventional machine learning algorithms namely, Extremely Randomized Trees Classifier (ExtraTrees), Extreme Gradient Boosting (XgBoost) Classifier, and Random Forest are used in the first layer which is then followed by a deep learning layer in the second stage. Before utilizing the classification model for malware detection, an optimum subset of features has been selected using Pearson correlation analysis which improved the accuracy of the model by more than 2 % for multiclass classification. It also reduces time complexity by approximately two and three times for binary and multiclass classification, respectively. For evaluating the performance of the proposed model, a recently published balanced dataset named CIC-MalMem-2022 has been used. Utilizing this dataset, the overall experimental results of the proposed model represent a superior performance when compared to the existing classification models.

Keywords