Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example

Gürol Canbek

doi:10.17350/HJSE19030000221

Hittite Journal of Science and Engineering (Jun 2021)

Gaining New Insight into Machine-Learning Datasets via Multiple Binary-Feature Frequency Ranks with a Mobile Benign/Malware Apps Example

Gürol Canbek

Affiliations

Gürol Canbek: ASELSAN

DOI: https://doi.org/10.17350/HJSE19030000221
Journal volume & issue: Vol. 8, no. 2
pp. 103 – 121

Abstract

Read online

Researchers compare their Machine Learning (ML) classification performances with other studies without examining and comparing the datasets they used in training, validating, and testing. One of the reasons is that there are not many convenient methods to give initial insights about datasets besides the descriptive statistics applied to individual continuous or quantitative features. After demonstrating initial manual analysis techniques, this study proposes a novel adaptation of the Kruskal-Wallis statistical test to compare a group of datasets over multiple prominent binary features that are very common in today’s datasets. As an illustrative example, the new method was tested on six benign/malign mobile application datasets over the frequencies of prominent binary features to explore the dissimilarity of the datasets per class. The feature vector consists of over a hundred “application permission requests” that are binary flags for Android platforms’ primary access control to provide privacy and secure data/information in mobile devices. Permissions are also the first leading transparent features for ML-based malware classification. The proposed data analytical methodology can be applied in any domain through their prominent features of interest. The results, which are also visualized in three new ways, have shown that the proposed method gives the dissimilarity degree among the datasets. Specifically, the conducted test shows that the frequencies in the aggregated dataset and some of the datasets are not substantially different from each other even they are in close agreement in positive-class datasets. It is expected that the proposed domain-independent method brings useful initial insight to researchers on comparing different datasets.

Published in Hittite Journal of Science and Engineering

ISSN: 2148-4171 (Online)
Publisher: Hitit University
Country of publisher: Türkiye
LCC subjects: Technology: Engineering (General). Civil engineering (General)
Website: https://dergipark.org.tr/en/pub/hjse

About the journal

Abstract

Keywords