Android malware dataset construction methodology to minimize bias–variance​ tradeoff

Shinho Lee; Wookhyun Jung; Wonrak Lee; Hyung Geun Oh; Eui Tak Kim

ICT Express (Sep 2022)

Android malware dataset construction methodology to minimize bias–variance tradeoff

Shinho Lee,
Wookhyun Jung,
Wonrak Lee,
Hyung Geun Oh,
Eui Tak Kim

Affiliations

Shinho Lee: Data Intelligence Lab, ESTsecurity, Seoul, Republic of Korea
Wookhyun Jung: Data Intelligence Lab, ESTsecurity, Seoul, Republic of Korea
Wonrak Lee: Data Intelligence Lab, ESTsecurity, Seoul, Republic of Korea
Hyung Geun Oh: National Security Research Institute, Daejeon, Republic of Korea
Eui Tak Kim: Data Intelligence Lab, ESTsecurity, Seoul, Republic of Korea; Corresponding author.

Journal volume & issue: Vol. 8, no. 3
pp. 444 – 462

Abstract

Read online

Recently, research on Android malware categorization and detection is increasingly directed toward proposing different learned models based on various features of Android apps and machine learning algorithms. For the implementation of such modeling, properly constructing a dataset is no less important than selecting a suitable algorithm. The present study examines dataset construction using Dexofuzzy and proposes methods to determine the degree of bias and variance in the process and minimize the noise in sample set labeling where there is a possibility that even the same samples can be differently labeled. The method proposed in the present study goes beyond existing dataset construction methods relying on label data provided by antivirus vendors to include an effective approach to construct new types of datasets built on unified labels combined with opcode morphology. Based on newly constructed datasets, a flexible dataset, which allows overfitting and underfitting to be considered, was obtained via N-Gram and M-Partial Matching. This flexible dataset was then subjected to clustering, and the resultant clustering performance was evaluated.

Published in ICT Express

ISSN: 2405-9595 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Technology: Technology (General): Industrial engineering. Management engineering: Information technology
Website: http://www.journals.elsevier.com/ict-express/

About the journal

Abstract

Keywords