Cybersecurity attacks: Which dataset should be used to evaluate an intrusion detection system?

Danijela D. Protić; Miomir M. Stanković

doi:10.5937/vojtehg71-46524

Vojnotehnički Glasnik (Oct 2023)

Cybersecurity attacks: Which dataset should be used to evaluate an intrusion detection system?

Danijela D. Protić,
Miomir M. Stanković

Affiliations

Danijela D. Protić: ORCiD; Serbian Armed Forces, General Staff, Department for Telecommunication and Informatics, Center for Applied Mathematics and Electronics, Belgrade, Republic of Serbia
Miomir M. Stanković: ORCiD; Mathematical Institute of the Serbian Academy of Sciences and Arts, Belgrade, Republic of Serbia

DOI: https://doi.org/10.5937/vojtehg71-46524
Journal volume & issue: Vol. 71, no. 4
pp. 970 – 995

Abstract

Read online

Introduction: Analyzing the high-dimensional datasets used for intrusion detection becomes a challenge for researchers. This paper presents the most often used data sets. ADFA contains two data sets containing records from Linux/Unix. AWID is based on actual traces of normal and intrusion activity of an IEEE 802.11 Wi-Fi network. CAIDA collects data types in geographically and topologically diverse regions. In CIC-IDS2017, HTTP, HTTPS, FTP, SSH, and email protocols are examined. CSECIC-2018 includes abstract distribution models for applications, protocols, or lower-level network entities. DARPA contains data of network traffic. ISCX 2012 dataset has profiles on various multi-stage attacks and actual network traffic with background noise. KDD Cup '99 is a collection of data transfer from a virtual environment. Kyoto 2006+ contains records of real network traffic. It is used only for anomaly detection. NSL-KDD corrects flaws in the KDD Cup '99 caused by redundant and duplicate records. UNSW-NB-15 is derived from real normal data and the synthesized contemporary attack activities of the network traffic. Methods: This study uses both quantitative and qualitative techniques. The scientific references and publicly accessible information about given dataset are used. Results: Datasets are often simulated to meet objectives required by a particular organization. The number of real datasets are very small compared to simulated dataset. Anomaly detection is rarely used today. Conclusion: 95 The main characteristics and a comparative analysis of the data sets in terms of the date they were created, the size, the number of features, the traffic types, and the purpose are presented.

Published in Vojnotehnički Glasnik

ISSN: 0042-8469 (Print); 2217-4753 (Online)
Publisher: University of Defence in Belgrade
Country of publisher: Serbia
LCC subjects: Military Science; Technology: Engineering (General). Civil engineering (General)
Website: http://www.vtg.mod.gov.rs/index-e.html

About the journal

Abstract

Keywords