IEEE Access (Jan 2023)
Validation of a Machine Learning-Based IDS Design Framework Using ORNL Datasets for Power System With SCADA
Abstract
Supervisory Control and Data Acquisition (SCADA) systems are widely used for remote monitoring and control of industrial processes, such as oil and gas production, power generation, transmission and distribution, and water treatment. Despite the enhanced accessibility, control, and data availability afforded by recent advances in communication technologies, the utilization of these technologies exposes critical infrastructures such as power systems to potential cyber threats. A Machine Learning (ML)-based Intrusion Detection System (IDS) seems promising; however, the development of ML models often requires custom methodologies for data preprocessing and training. This strategic approach is necessary for creating high-performance models that can be robustly evaluated and seamlessly integrated into real-time systems. As a result, we propose an ML-based IDS design framework for a SCADA-based power system incorporating effective modeling aspects, such as dataset preprocessing to ensure accurate representation, data augmentation for achieving a balanced dataset, automated feature selection to reduce dimensionality, and rigorous model training and testing procedures. To substantiate our proposed design framework, we conducted a series of experiments using a publicly available ORNL (Oak Ridge National Laboratory) dataset for a SCADA-based power system. The evaluation process encompasses efficient validation techniques with unseen data. Furthermore, the augmented dataset emerged through the aggregation of readings from four Phasor Measurement Units (PMUs) collected over a specific time span into a unified dataset. Among the assessed classifiers, the Random Forest (RF) model, trained on an augmented and balanced dataset, outperformed others, yielding an F1 score of 94.09% during testing with unseen data.
Keywords