IEEE Access (Jan 2021)

An Effective Cost-Sensitive XGBoost Method for Malicious URLs Detection in Imbalanced Dataset

  • Shen He,
  • Bangling Li,
  • Huaxi Peng,
  • Jun Xin,
  • Erpeng Zhang

DOI
https://doi.org/10.1109/ACCESS.2021.3093094
Journal volume & issue
Vol. 9
pp. 93089 – 93096

Abstract

Read online

Imbalanced class has been a common problem encountered in the modeling process, and has attracted more and more attention from scholars. Biased classifiers, which limit the classifiers’ performance for minority classes, will be produced if the imbalanced ratio between the number of positive labels and negative labels is ignored. The synthetic minority over-sampling technique (SMOTE) is a very classic and popular over-sampling method, which is widely used to address this problem. However, SMOTE increases label noise and the training time during the over-sampling process. To improve the detection rate of minority classes while ensuring efficiency, we propose a cost-sensitive XGBoost (CS-XGB) for the imbalanced data problem. The CS-XGB method can reduce the classifiers’ preference for most classes without changing the distribution of the original data. 600000 Uniform Resource Locators (URLs) were collected to validate the CS-XGB method. We compare XGBoost (XGB), SMOTE+XGB and CS-XGB, and the experimental results confirm that the CS-XGB is robust and efficient for imbalanced cases.

Keywords