IEEE Access (Jan 2025)
Addressing Data Imbalance in Crash Data: Evaluating Generative Adversarial Network’s Efficacy Against Conventional Methods
Abstract
In the realm of traffic safety analysis, the inherent imbalance in crash datasets, particularly in terms of injury severity, poses a significant challenge for machine learning-based classification models. This study delves into the efficacy of Generative Adversarial Networks (GANs), with a specific focus on Conditional Tabular GAN (CTGAN), for synthesizing minority crash data to address this imbalance. Utilizing traffic crash data from Chicago spanning 2020 to 2022, the research evaluates the capabilities of CTGAN against three traditional data resampling methods, as well as an additional cost-sensitive learning approach. These methods are evaluated across various injury severity classification scenarios (2-class, 3-class, and 4-class) using five commonly applied injury severity classification models. The study’s dual evaluation approach encompasses both the quality of synthetic data and the enhancement of classification model performance. The pivotal findings reveal that: 1) CTGAN markedly outperforms other data resampling techniques in generating superior quality synthetic data, particularly for the least represented injury severity category; 2) While CTGAN demonstrates substantial improvements over traditional data resampling methods in classification model performance, this advantage diminishes as the number of injury categories increases; 3) Surprisingly, CTGAN’s superior data quality does not result in better classification performance compared to cost-sensitive learning, especially in more complex classification scenarios. Cost-sensitive learning combined with LightGBM achieves the best classification performance across all scenarios. Given the significantly lower computational resources required by cost-sensitive learning, this approach is recommended for handling imbalanced injury severity data.
Keywords