How to Effectively Collect and Process Network Data for Intrusion Detection?

Mikołaj Komisarek; Marek Pawlicki; Rafał Kozik; Witold Hołubowicz; Michał Choraś

doi:10.3390/e23111532

Entropy (Nov 2021)

How to Effectively Collect and Process Network Data for Intrusion Detection?

Mikołaj Komisarek,
Marek Pawlicki,
Rafał Kozik,
Witold Hołubowicz,
Michał Choraś

Affiliations

Mikołaj Komisarek: ITTI Sp. z o.o., Rubież 46, 61-612 Poznań, Poland
Marek Pawlicki: ITTI Sp. z o.o., Rubież 46, 61-612 Poznań, Poland
Rafał Kozik: ITTI Sp. z o.o., Rubież 46, 61-612 Poznań, Poland
Witold Hołubowicz: Institute of Telecommunications and Computer Science, Bydgoszcz University of Science and Technology, 85-796 Bydgoszcz, Poland
Michał Choraś: Faculty of Mathematics and Computer Science, FernUniversität in Hagen, Universitatsstrasse 11, 58097 Hagen, Germany

DOI: https://doi.org/10.3390/e23111532
Journal volume & issue: Vol. 23, no. 11
p. 1532

Abstract

Read online

The number of security breaches in the cyberspace is on the rise. This threat is met with intensive work in the intrusion detection research community. To keep the defensive mechanisms up to date and relevant, realistic network traffic datasets are needed. The use of flow-based data for machine-learning-based network intrusion detection is a promising direction for intrusion detection systems. However, many contemporary benchmark datasets do not contain features that are usable in the wild. The main contribution of this work is to cover the research gap related to identifying and investigating valuable features in the NetFlow schema that allow for effective, machine-learning-based network intrusion detection in the real world. To achieve this goal, several feature selection techniques have been applied on five flow-based network intrusion detection datasets, establishing an informative flow-based feature set. The authors’ experience with the deployment of this kind of system shows that to close the research-to-market gap, and to perform actual real-world application of machine-learning-based intrusion detection, a set of labeled data from the end-user has to be collected. This research aims at establishing the appropriate, minimal amount of data that is sufficient to effectively train machine learning algorithms in intrusion detection. The results show that a set of 10 features and a small amount of data is enough for the final model to perform very well.

Published in Entropy

ISSN: 1099-4300 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Astronomy: Astrophysics; Science: Physics
Website: http://www.mdpi.com/journal/entropy

About the journal

Abstract

Keywords