Symmetry (Mar 2020)

Natural Language Processing Model for Automatic Analysis of Cybersecurity-Related Documents

  • Tiberiu-Marian Georgescu

DOI
https://doi.org/10.3390/sym12030354
Journal volume & issue
Vol. 12, no. 3
p. 354

Abstract

Read online

This paper describes the development and implementation of a natural language processing model based on machine learning which performs cognitive analysis for cybersecurity-related documents. A domain ontology was developed using a two-step approach: (1) the symmetry stage and (2) the machine adjustment. The first stage is based on the symmetry between the way humans represent a domain and the way machine learning solutions do. Therefore, the cybersecurity field was initially modeled based on the expertise of cybersecurity professionals. A dictionary of relevant entities was created; the entities were classified into 29 categories and later implemented as classes in a natural language processing model based on machine learning. After running successive performance tests, the ontology was remodeled from 29 to 18 classes. Using the ontology, a natural language processing model based on a supervised learning model was defined. We trained the model using sets of approximately 300,000 words. Remarkably, our model obtained an F1 score of 0.81 for named entity recognition and 0.58 for relation extraction, showing superior results compared to other similar models identified in the literature. Furthermore, in order to be easily used and tested, a web application that integrates our model as the core component was developed.

Keywords