ICT Express (Mar 2022)
Compact feature hashing for machine learning based malware detection
Abstract
Machine learning can detect variant malware files that can evade signature-based detection. Feature hashing is used to convert features into a fixed-length vector. In this paper, we study the appropriate vector size for feature hashing for a large dataset of malware files. Through exhaustive experiments on more than 280,000 real malware and benign files, we find for the first time that the default vector size of current feature hashing practices is unnecessarily large. We experimentally explore the appropriate vector size, which not only reduces memory space by 70% but also increases the detection accuracy, compared with the state-of-the-art scheme.