Balancing the Scale: Data Augmentation Techniques for Improved Supervised Learning in Cyberattack Detection

Kateryna Medvedieva; Tommaso Tosi; Enrico Barbierato; Alice Gatti

doi:10.3390/eng5030114

Eng (Sep 2024)

Balancing the Scale: Data Augmentation Techniques for Improved Supervised Learning in Cyberattack Detection

Kateryna Medvedieva,
Tommaso Tosi,
Enrico Barbierato,
Alice Gatti

Affiliations

Kateryna Medvedieva: Department of Mathematics and Physics, Catholic University of the Sacred Heart, 25121 Brescia, Italy
Tommaso Tosi: Department of Mathematics and Physics, Catholic University of the Sacred Heart, 25121 Brescia, Italy
Enrico Barbierato: Department of Mathematics and Physics, Catholic University of the Sacred Heart, 25121 Brescia, Italy
Alice Gatti: Department of Mathematics and Physics, Catholic University of the Sacred Heart, 25121 Brescia, Italy

DOI: https://doi.org/10.3390/eng5030114
Journal volume & issue: Vol. 5, no. 3
pp. 2170 – 2205

Abstract

Read online

The increasing sophistication of cyberattacks necessitates the development of advanced detection systems capable of accurately identifying and mitigating potential threats. This research addresses the critical challenge of cyberattack detection by employing a comprehensive approach that includes generating a realistic yet imbalanced dataset simulating various types of cyberattacks. Recognizing the inherent limitations posed by imbalanced data, we explored multiple data augmentation techniques to enhance the model’s learning effectiveness and ensure robust performance across different attack scenarios. Firstly, we constructed a detailed dataset reflecting real-world conditions of network intrusions by simulating a range of cyberattack types, ensuring it embodies the typical imbalances observed in genuine cybersecurity threats. Subsequently, we applied several data augmentation techniques, including SMOTE and ADASYN, to address the skew in class distribution, thereby providing a more balanced dataset for training supervised machine learning models. Our evaluation of these techniques across various models, such as Random Forests and Neural Networks, demonstrates significant improvements in detection capabilities. Moreover, the analysis also extends to the investigation of feature importance, providing critical insights into which attributes most significantly influence the predictive outcomes of the models. This not only enhances the interpretability of the models but also aids in refining feature engineering and selection processes to optimize performance.

Published in Eng

ISSN: 2673-4117 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://www.mdpi.com/journal/eng

About the journal

Abstract

Keywords