Complex & Intelligent Systems (Jun 2024)

An oversampling algorithm of multi-label data based on cluster-specific samples and fuzzy rough set theory

  • Jinming Liu,
  • Kai Huang,
  • Chen Chen,
  • Jian Mao

DOI
https://doi.org/10.1007/s40747-024-01498-w
Journal volume & issue
Vol. 10, no. 5
pp. 6267 – 6282

Abstract

Read online

Abstract Imbalanced class distributions are common in real-world scenarios, including datasets with multiple labels. One widely acknowledged approach to addressing imbalanced distributions is through oversampling, a technique that both balances the class distribution and improves the effectiveness of classification models. However, when generating synthetic data for multi-label datasets, complexities arise due to the presence of multiple-label sets, which require careful placement and labeling. We propose MLCSMOTE-FRST, an algorithm for synthetic data generation based on label-specific clustering and fuzzy rough set theory. Generation ratios and dependency samples are provided by clusters specific to each label, with a focus on the overall label distribution and the distribution within each cluster. The labels are supported by intra-cluster positive samples, determined using fuzzy rough set theory, which helps to capture the consensus label set. Experimental results on multi-label datasets using four classifiers demonstrate the effectiveness of the proposed method in terms of macro-F1 and micro-F1 scores.

Keywords