BVTED: A Specialized Bilingual (Chinese–English) Dataset for Vulnerability Triple Extraction Tasks

Kai Liu; Yi Wang; Zhaoyun Ding; Aiping Li; Weiming Zhang

doi:10.3390/app14167310

Applied Sciences (Aug 2024)

BVTED: A Specialized Bilingual (Chinese–English) Dataset for Vulnerability Triple Extraction Tasks

Kai Liu,
Yi Wang,
Zhaoyun Ding,
Aiping Li,
Weiming Zhang

Affiliations

Kai Liu: National Key Laboratory of Information Systems Engineering, National University of Defense Technology, Changsha 410073, China
Yi Wang: National Key Laboratory of Information Systems Engineering, National University of Defense Technology, Changsha 410073, China
Zhaoyun Ding: National Key Laboratory of Information Systems Engineering, National University of Defense Technology, Changsha 410073, China
Aiping Li: School of Computer, National University of Defense Technology, Changsha 410073, China
Weiming Zhang: National Key Laboratory of Information Systems Engineering, National University of Defense Technology, Changsha 410073, China

DOI: https://doi.org/10.3390/app14167310
Journal volume & issue: Vol. 14, no. 16
p. 7310

Abstract

Read online

Extracting knowledge from cyber threat intelligence is essential for understanding cyber threats and implementing proactive defense measures. However, there is a lack of open datasets in the Chinese cybersecurity field that support both entity and relation extraction tasks. This paper addresses this gap by analyzing vulnerability description texts, which are standardized and knowledge-dense, to create a vulnerability knowledge ontology comprising 13 entities and 15 relations. We annotated 27,311 unique vulnerability description sentences from the China National Vulnerability Database, resulting in a dataset named BVTED for cybersecurity knowledge triple extraction tasks. BVTED contains 97,391 entities and 69,614 relations, with entities expressed in a mix of Chinese and English. To evaluate the dataset’s value, we trained five deep learning-based named entity recognition models, two relation extraction models, and two joint entity–relation extraction models on BVTED. Experimental results demonstrate that models trained on this dataset achieve excellent performance in vulnerability knowledge extraction tasks. This work enhances the extraction of cybersecurity knowledge triples from mixed Chinese and English threat intelligence corpora by providing a comprehensive ontology and a new dataset, significantly aiding in the mining, analysis and utilization of the knowledge embedded in cyber threat intelligence.

Published in Applied Sciences

ISSN: 2076-3417 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Engineering (General). Civil engineering (General); Science: Biology (General); Science: Physics; Science: Chemistry
Website: http://www.mdpi.com/journal/applsci

About the journal

Abstract

Keywords