Applied Sciences (May 2025)
Semi-Supervised Learning for Intrusion Detection in Large Computer Networks
Abstract
In an increasingly interconnected world, securing large networks against cyber-threats has become paramount as cyberattacks become more rampant, difficult, and expensive to remedy. This research explores data-driven security by applying semi-supervised machine learning techniques for intrusion detection in large-scale network environments. Novel methods (including decision tree with entropy-based uncertainty sampling, logistic regression with self-training, and co-training with random forest) are proposed to perform intrusion detection with limited labeled data. These methods leverage both available labeled data and abundant unlabeled data. Extensive experiments on the CIC-DDoS2019 dataset show promising results; both the decision tree with entropy-based uncertainty sampling and the co-training with random forest models achieve 99% accuracy. Furthermore, the UNSW-NB15 dataset is introduced to conduct a comparative analysis between base models (random forest, decision tree, and logistic regression) when using only labeled data and the proposed models when using partially labeled data. The proposed methods demonstrate superior results when using 1%, 10%, and 50% labeled data, highlighting their effectiveness and potential for improving intrusion detection systems in scenarios with limited labeled data.
Keywords