Data in Brief (Dec 2023)
Android malware detection with MH-100K: An innovative dataset for advanced research
Abstract
High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time.