IEEE Access (Jan 2023)
Standard Latent Space Dimension for Network Intrusion Detection Systems Datasets
Abstract
Machine learning is a branch of artificial intelligence that provides computers the ability to create or improve algorithms without being explicitly programmed by directly learning from data. It is widely used in automation or decision-making tasks in fields such as image or speech recognition, sentiment analysis, or self-driving cars. However, its application in the field of communication networks is limited by the lack of appropriate research resources, such as rich datasets for training or the definition of a standard set of features. In this context, a standard latent space dimension is proposed by performing an autoencoder-based dimensionality reduction process. Different network security datasets are projected onto a lower-dimensional space to determine a standard or convergent dimension. The convergent dimension is determined by identifying the threshold above which diminishing returns begin to occur in the autoencoder loss as the latent space dimension increases. The experimental validation showed that four machine learning classification models, trained with a standard latent space of ten dimensions, performed as well as the models that used the non-reduced versions of the datasets in terms of F1-score and accuracy. Furthermore, a Wilcoxon statistical test showed that the mean accuracy of all classification models trained with the standard latent space dimension had a difference of less than 0.0235 in comparison to the models trained with the original inputs. A negligible difference in accuracy is a significant outcome because researchers can use only the latent space to perform experiments with certainty that the performance of ML models will not be constrained.
Keywords