Digital Zone: Jurnal Teknologi Informasi dan Komunikasi (May 2024)

Comparative Study of the Effect of Datasets and Machine Learning Algorithms for PDF Malware Detection

  • Salman Wiharja,
  • Deden Pradeka,
  • Wirmanto Suteddy

DOI
https://doi.org/10.31849/digitalzone.v15i1.19744
Journal volume & issue
Vol. 15, no. 1
pp. 80 – 93

Abstract

Read online

This research presents an innovative approach to detecting malicious PDFs through machine learning algorithms, focusing on the expansion of the Evasive-PDFMal2022 dataset. The objective is to enhance the accuracy of detecting malicious PDFs by enriching the dataset, augmenting its representation and diversity, and developing a practical tool—a website—for extracting and detecting malicious PDFs. The methodology involves updating and enlarging the dataset with additional malicious PDFs sourced from CVE and Exploit-db, along with non-malicious PDFs from diverse origins. Features are then extracted using the PDFID tool, and these 20 features serve as the foundation for implementing K-Nearest Neighbor (KNN), Random Forest, and Random Committee algorithms. The outcomes demonstrate that the model trained with the expanded dataset achieves a remarkable 99% accuracy, surpassing the performance of models relying solely on the Evasive-PDFMal2022 dataset. Additionally, this research significantly enhances the representation and diversity of the dataset while delivering a practical solution in the form of a website tailored for the extraction and detection of malicious PDFs.

Keywords