Machine Learning and Knowledge Extraction (Aug 2020)
Hierarchy-Based File Fragment Classification
Abstract
File fragment classification is an essential problem in digital forensics. Although several attempts had been made to solve this challenging problem, a general solution has not been found. In this work, we propose a hierarchical machine-learning-based approach with optimized support vector machines (SVM) as the base classifiers for file fragment classification. This approach consists of more general classifiers at the top level and more specialized fine-grain classifiers at the lower levels of the hierarchy. We also propose a primitive taxonomy for file types that can be used to perform hierarchical classification. We evaluate our model with a dataset of 14 file types, with 1000 fragments measuring 512 bytes from each file type derived from a subset of the publicly available Digital Corpora, the govdocs1 corpus. Our experiment shows comparable results to the present literature, with an average accuracy of 67.78% and an F1-measure of 65% using 10-fold cross-validation. We then improve on the hierarchy and find better results, with an increase in the F1-measure of 1%. Finally, we make our assessment and observations, then conclude the paper by discussing the scope of future research.
Keywords