Applied Sciences (Aug 2024)
BVTED: A Specialized Bilingual (Chinese–English) Dataset for Vulnerability Triple Extraction Tasks
Abstract
Extracting knowledge from cyber threat intelligence is essential for understanding cyber threats and implementing proactive defense measures. However, there is a lack of open datasets in the Chinese cybersecurity field that support both entity and relation extraction tasks. This paper addresses this gap by analyzing vulnerability description texts, which are standardized and knowledge-dense, to create a vulnerability knowledge ontology comprising 13 entities and 15 relations. We annotated 27,311 unique vulnerability description sentences from the China National Vulnerability Database, resulting in a dataset named BVTED for cybersecurity knowledge triple extraction tasks. BVTED contains 97,391 entities and 69,614 relations, with entities expressed in a mix of Chinese and English. To evaluate the dataset’s value, we trained five deep learning-based named entity recognition models, two relation extraction models, and two joint entity–relation extraction models on BVTED. Experimental results demonstrate that models trained on this dataset achieve excellent performance in vulnerability knowledge extraction tasks. This work enhances the extraction of cybersecurity knowledge triples from mixed Chinese and English threat intelligence corpora by providing a comprehensive ontology and a new dataset, significantly aiding in the mining, analysis and utilization of the knowledge embedded in cyber threat intelligence.
Keywords