FAM: Featuring Android Malware for Deep Learning-Based Familial Analysis

Younghoon Ban; Sunjun Lee; Dokyung Song; Haehyun Cho; Jeong Hyun Yi

doi:10.1109/ACCESS.2022.3151357

IEEE Access (Jan 2022)

FAM: Featuring Android Malware for Deep Learning-Based Familial Analysis

Younghoon Ban,
Sunjun Lee,
Dokyung Song,
Haehyun Cho,
Jeong Hyun Yi

Affiliations

Younghoon Ban: ORCiD; School of Software Convergence, Soongsil University, Seoul, South Korea
Sunjun Lee: School of Software, Soongsil University, Seoul, South Korea
Dokyung Song: ORCiD; Department of Computer Science, Yonsei University, Seoul, South Korea
Haehyun Cho: ORCiD; School of Software, Soongsil University, Seoul, South Korea
Jeong Hyun Yi: ORCiD; School of Software, Soongsil University, Seoul, South Korea

DOI: https://doi.org/10.1109/ACCESS.2022.3151357
Journal volume & issue: Vol. 10
pp. 20008 – 20018

Abstract

Read online

To handle relentlessly emerging Android malware, deep learning has been widely adopted in the research community. Prior work proposed deep learning-based approaches that use different features of malware, and reported a high accuracy in malware detection, i.e., classifying malware from benign applications. However, familial analysis of real-world Android malware has not been extensively studied yet. Familial analysis refers to the process of classifying a given malware into a family (or a set of families), which can greatly accelerate malware analysis as the analysis gives their fine-grained behavioral characteristics. In this work, we shed light on deep learning-based familial analysis by studying different features of Android malware and how effectively they can represent their (malicious) behaviors. We focus on string features of Android malware, namely the Abstract Syntax Trees (AST) of all functions extracted from each malware, which faithfully represent all string features of Android malware. We thoroughly study how different string features, such as how security-sensitive APIs are used in malware, affect the performance of our deep learning-based familial analysis model. A convolutional neural network was trained and tested in various configurations on 28,179 real-world malware dataset appeared in the wild from 2018 to 2020, where each malware has one or more labels assigned based on their behaviors. Our evaluation reveals how different features contribute to the performance of familial analysis. Notably, with all features combined, we were able to produce up to an accuracy of 98% and a micro F1-score of 0.82, a result on par with the state-of-the-art.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords