IEEE Access (Jan 2024)

PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

  • G. M. Sakhawat Hossain,
  • Kaushik Deb,
  • Helge Janicke,
  • Iqbal H. Sarker

DOI
https://doi.org/10.1109/ACCESS.2024.3357620
Journal volume & issue
Vol. 12
pp. 13833 – 13859

Abstract

Read online

The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malware because of their versatile character and excessive dependence on a certain typical feature set. The primary goal of this work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal, we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent, malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD, PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly created dataset. In addition, we generate a number of derivations of features that have been experimentally proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable feature set through the proper empirical analysis of the extracted and derived features. We explore different baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model’s explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make a comparison with previous studies and point out some important findings.

Keywords