Comparative Study of the Effect of Datasets and Machine Learning Algorithms for PDF Malware Detection

Salman Wiharja; Deden Pradeka; Wirmanto Suteddy

doi:10.31849/digitalzone.v15i1.19744

Digital Zone: Jurnal Teknologi Informasi dan Komunikasi (May 2024)

Comparative Study of the Effect of Datasets and Machine Learning Algorithms for PDF Malware Detection

Salman Wiharja,
Deden Pradeka,
Wirmanto Suteddy

Affiliations

Salman Wiharja: Universitas Pendidikan Indonesia
Deden Pradeka
Wirmanto Suteddy

DOI: https://doi.org/10.31849/digitalzone.v15i1.19744
Journal volume & issue: Vol. 15, no. 1
pp. 80 – 93

Abstract

Read online

This research presents an innovative approach to detecting malicious PDFs through machine learning algorithms, focusing on the expansion of the Evasive-PDFMal2022 dataset. The objective is to enhance the accuracy of detecting malicious PDFs by enriching the dataset, augmenting its representation and diversity, and developing a practical tool—a website—for extracting and detecting malicious PDFs. The methodology involves updating and enlarging the dataset with additional malicious PDFs sourced from CVE and Exploit-db, along with non-malicious PDFs from diverse origins. Features are then extracted using the PDFID tool, and these 20 features serve as the foundation for implementing K-Nearest Neighbor (KNN), Random Forest, and Random Committee algorithms. The outcomes demonstrate that the model trained with the expanded dataset achieves a remarkable 99% accuracy, surpassing the performance of models relying solely on the Evasive-PDFMal2022 dataset. Additionally, this research significantly enhances the representation and diversity of the dataset while delivering a practical solution in the form of a website tailored for the extraction and detection of malicious PDFs.

Published in Digital Zone: Jurnal Teknologi Informasi dan Komunikasi

ISSN: 2086-4884 (Print); 2477-3255 (Online)
Publisher: Universitas Lancang Kuning
Country of publisher: Indonesia
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Telecommunication; Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware
Website: https://journal.unilak.ac.id/index.php/dz

About the journal

Abstract

Keywords