IEEE Access (Jan 2020)
An Improved Method of Detecting Macro Malware on an Imbalanced Dataset
Abstract
In spear-phishing attacks, macro malware written in VBA (Visual Basic for Applications) is often used to compromise the target computers. Macro malware is often obfuscated in several ways to evade detection. To detect new macro malware, several methods with machine learning techniques have been proposed. While many methods were evaluated with the inadequate or balanced dataset with the same number of benign and malicious samples, practical performance is still open to discussion. In reality, the population of VBA macros consists of wide variety of samples. To evaluate practical performance, an imbalanced dataset which contains many benign samples is required. In this paper, we propose an improved method of detecting macro malware on an imbalanced dataset. Our method uses 2 language models (Doc2vec and Latent Semantic Indexing (LSI)) and 4 popular classifiers. These language models are used to extract features and mitigate the class imbalance problem by selecting important features. We create an imbalanced dataset with more than 30,000 samples and evaluate the practical performance. The experimental result demonstrates that our method mitigates the class imbalance problem and could detect completely new malware regardless of the family type. The result also reveals that LSI is more robust than Doc2vec to the class imbalance problem.
Keywords