IEEE Access (Jan 2022)
Predicting CVSS Metric via Description Interpretation
Abstract
Cybercrime affects companies worldwide, costing millions of dollars annually. The constant increase of threats and vulnerabilities raises the need to handle vulnerabilities in a prioritized manner. This prioritization can be achieved through Common Vulnerability Scoring System (CVSS), typically used to assign a score to a vulnerability. However, there is a temporal mismatch between the vulnerability finding and score assignment, which motivates the development of approaches to aid in this aspect. We explore the use of Natural Language Processing (NLP) models in CVSS score prediction given vulnerability descriptions. We start by creating a vulnerability dataset from the National Vulnerability Database (NVD). Then, we combine text pre-processing and vocabulary addition to improve the model accuracy and interpret its prediction reasoning by assessing word importance, via Shapley values. Experiments show that the combination of Lemmatization and 5,000-word addition is optimal for DistilBERT, the outperforming model in our experiments of the NLP methods, achieving state-of-the-art results. Furthermore, specific events (such as an attack on a known software) tend to influence model prediction, which may hinder CVSS prediction. Combining Lemmatization with vocabulary addition mitigates this effect, contributing to increased accuracy. Finally, binary classes benefit the most from pre-processing techniques, particularly when one class is much more prominent than the other. Our work demonstrates that DistilBERT is a state-of-the-art model for CVSS prediction, demonstrating the applicability of deep learning approaches to aid in vulnerability handling. The code and data are available at https://github.com/Joana-Cabral/CVSS_Prediction.
Keywords