An Effective Cost-Sensitive XGBoost Method for Malicious URLs Detection in Imbalanced Dataset

Shen He; Bangling Li; Huaxi Peng; Jun Xin; Erpeng Zhang

doi:10.1109/ACCESS.2021.3093094

IEEE Access (Jan 2021)

An Effective Cost-Sensitive XGBoost Method for Malicious URLs Detection in Imbalanced Dataset

Shen He,
Bangling Li,
Huaxi Peng,
Jun Xin,
Erpeng Zhang

Affiliations

Shen He: Department of Security Technology, China Mobile Research Institute, Beijing, China
Bangling Li: ORCiD; Department of Security Technology, China Mobile Research Institute, Beijing, China
Huaxi Peng: Department of Security Technology, China Mobile Research Institute, Beijing, China
Jun Xin: Department of Security Technology, China Mobile Research Institute, Beijing, China
Erpeng Zhang: Department of Security Technology, China Mobile Research Institute, Beijing, China

DOI: https://doi.org/10.1109/ACCESS.2021.3093094
Journal volume & issue: Vol. 9
pp. 93089 – 93096

Abstract

Read online

Imbalanced class has been a common problem encountered in the modeling process, and has attracted more and more attention from scholars. Biased classifiers, which limit the classifiers’ performance for minority classes, will be produced if the imbalanced ratio between the number of positive labels and negative labels is ignored. The synthetic minority over-sampling technique (SMOTE) is a very classic and popular over-sampling method, which is widely used to address this problem. However, SMOTE increases label noise and the training time during the over-sampling process. To improve the detection rate of minority classes while ensuring efficiency, we propose a cost-sensitive XGBoost (CS-XGB) for the imbalanced data problem. The CS-XGB method can reduce the classifiers’ preference for most classes without changing the distribution of the original data. 600000 Uniform Resource Locators (URLs) were collected to validate the CS-XGB method. We compare XGBoost (XGB), SMOTE+XGB and CS-XGB, and the experimental results confirm that the CS-XGB is robust and efficient for imbalanced cases.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords