Tongxin xuebao (Jun 2022)
Research on entity recognition and alignment of APT attack based on Bert and BiLSTM-CRF
Abstract
Objectives: In the face of the complex and changing network security environment, how to fight against Advanced Persistent Threat (APT) attacks has become an urgent problem for the entire security community. The massive APT attack analysis reports and threat intelligence generated by security companies have significant research value. They can effectively provide the information of APT organizations, thereby assisting in the traceability analysis of network attack events. Aiming at the problem that APT analysis reports have not been fully utilized, and there is a lack of automation methods to generate structured knowledge and construct feature portraits of the hacker organizations, an automatic knowledge extraction method of APT attacks combining entity recognition and entity alignment is proposed. The proposed method can automatically extract entities from APT analysis reports and construct structured knowledge of the APT organization. Methods: An automatic extraction method of APT attack knowledge that integrates entity recognition and entity alignment is designed. Firstly, 12 entity categories are designed according to the characteristics of APT attacks. Then, lowercase conversion, data cleaning, and data annotation are performed on the corpus through the preprocessing layer, and the preprocessed APT text sequence is represented as a vector. Secondly, the Bert model is built to pre-train the annotated corpus, encode each word, and generate the corresponding word vector. Also, the BiLSTM model is constructed to capture long-distance and contextual semantic features. The attention mechanism is built to highlight key features and convert the vector sequence into an annotation probability matrix. Thirdly, the CRF algorithm is utilized to decode the relationship between the output predicted labels and generate the optimal label sequence. Finally, the entity alignment method based on semantic similarity and Birch is constructed, which can improve the quality of the extracted APT attack knowledge through knowledge matching and merging into the infobox of each APT organization. Results: In terms of entity recognition, the proposed APT attack entity recognition method is superior to the existing entity recognition methods (i.e., CRF, LSTM-CRF, GRU-CRF, BiLSTMCRF, CNN-CRF, and Bert-CRF). The experimental results of our method have been improved to a certain extent, whose precision, recall, and F1-score are 0.929 6, 0.873 3, and 0.900 6. Compared with CRF, the F1-score of the proposed model is increased by 14.32%. Compared with CNN-CRF, which integrates convolutional neural networks, the F1-score of the proposed model is increased by 6.92%. Compared with LSTM-CRF and BiLSTM-CRF, the F1-score of the proposed model is increased by 8.43% and 5.30%, respectively. Compared with GRU-CRF, the F1-score of this model is increased by 8.74%. Compared with Bert-CRF, the F1-score of this model is increased by 7.03%. In addition, the accuracy of the proposed model is 0.9004, which is 9.85% higher than the average of the other six models. Also, the proposed model's training process is more stable, and the entire curve converges faster, which can achieve higher accuracy with fewer training batches. The model's error converges faster in the training period, and the curve is smoother. Moreover, the proposed model has the best prediction effect on the "attack method" entity category, whose F1-score is 0.927 5. On the one hand, a large number of entities exist in this category. On the other hand, this category of entities widely exists in semantic-rich APT attack events and has the action characteristics of attack behavior, which leads to a better recognition effect of this category. In terms of entity recognition with small sample annotation, the proposed method's precision, recall, and F1-score are 0.780 0, 0.589 4, and 0.671 4, respectively. Compared with the CRF model, LSTM-CRF model,GRU-CRF model, BiLSTM-CRF model, CNN-CRF model, and Bert-CRF model, the F1-score values of the proposed model are improved by 27.42%, 18.78%, 23.62%, 13.25%, 14.88%, and 14.46%. This experiment fully demonstrates that the proposed method can perform pre-training on a small sample corpus through the Bert model, thereby improving the effect of entity recognition. In terms of entity alignment and knowledge fusion, the experiment automatically extracts named entities with the high frequency of various entity categories, which often exist in APT attack events. For example, common APT organizations include "APT29", "APT32", "APT28", and "Turla";common attack equipment includes "PowerShell", "Cobalt Strike", and "Mimikatz"; common attack methods include "Spearphishing", "C2", "Watering Hole Attack", and "Backdoor"; common vulnerabilities include "CVE-2017-11882", "CVE-2017-0199", and "CVE-2012-0158", etc. The proposed method combines the corpus titles and keywords to carry out entity fusion of APT organization names. Finally, the infobox of common APT organizations in this dataset is constructed, and the structured knowledge of each APT organization is formed. Also, the attack domain knowledge of APT28 and APT32 is shown in detail. Conclusions: According to the characteristics of APT attacks, an automatic extraction method of APT attack knowledge based on entity recognition and entity alignment is designed and implemented. This method can effectively identify APT attack entities, automatically extract advanced persistent threat knowledge under the condition of few-sample annotation, and generate structured feature portraits of common APT organizations, which will provide support for subsequent APT attack knowledge graph construction and attack traceability analysis.