IEEE Access (Jan 2021)
An Efficient Network Classification Based on Various-Widths Clustering and Semi-Supervised Stacking
Abstract
Network traffic classification is basic tool for internet service providers, various government and private organisations to carry out investigation on network activities such as Intrusion Detection Systems (IDS), security monitoring, lawful interception and Quality of Service (QoS). Recent network traffic classification approaches have used an extracted and predefined class label which come from multiple experts to build a robust network traffic classifier. However, keeping IP traffic classifiers up to date requires large amounts of new emerging labeled traffic flows which is often expensive and time-consuming. This paper proposes an efficient network classification (named Net-Stack) which inherits the advantages of various widths clustering and semi-supervised stacking to minimize the shortage of labeled flows, and accurately learn IP traffic features and knowledge. The Net-Stack approach consists of four stages. The first stage pre-processes the traffic data and removes noise traffic observations based on various widths clustering to select most representative observations from both the local and global perspective. The second stage generates strong discrimination ability for multiview representations of the original data using dimensionality reduction techniques. The third stage involves heterogeneous semi-supervised learning algorithms to exploit the complementary information contained in multiple views to refine the decision boundaries for each traffic class and get a low dimensional metadata representation. The final stage employs a meta-classifier and stacking approach to comprehensively learn from the metadata representation obtained in stage three for improving the generalization performance and predicting final classification decision. Experimental study on twelve traffic data sets shows the effectiveness of our proposed Net-Stack approach compared to the baseline methods when there is relatively less labelled training data available.
Keywords