Applied Sciences (Nov 2024)

An Interpretable Machine Learning-Based Hurdle Model for Zero-Inflated Road Crash Frequency Data Analysis: Real-World Assessment and Validation

  • Moataz Bellah Ben Khedher,
  • Dukgeun Yun

DOI
https://doi.org/10.3390/app142310790
Journal volume & issue
Vol. 14, no. 23
p. 10790

Abstract

Read online

Road traffic crashes pose significant economic and public health burdens, necessitating an in-depth understanding of crash causation and its links to underlying factors. This study introduces a machine learning-based hurdle model framework tailored for analyzing zero-inflated crash frequency data, addressing the limitations of traditional statistical models like the Poisson and negative binomial models, which struggle with zero-inflation and overdispersion. The research employs a two-stage modeling process using CatBoost. The first stage uses binary classification to identify road segments with potential crash occurrences, applying a customized loss function to tackle data imbalance. The second stage predicts crash frequency, also utilizing a customized loss function for count data. SHapley Additive exPlanations (SHAP) analysis interprets the model outcomes, providing insights into factors affecting crash likelihood and frequency. This study validates the model’s performance with real-world crash data from 2011 to 2015 in South Korea, demonstrating superior accuracy in both the classification and regression stages compared to other machine learning algorithms and traditional models. These findings have significant implications for traffic safety research and policymaking, offering stakeholders a more accurate and interpretable tool for crash data analysis to develop targeted safety interventions.

Keywords