Transportation Research Interdisciplinary Perspectives (May 2023)

A study on road accident prediction and contributing factors using explainable machine learning models: analysis and performance

  • Shakil Ahmed,
  • Md Akbar Hossain,
  • Sayan Kumar Ray,
  • Md Mafijul Islam Bhuiyan,
  • Saifur Rahman Sabuj

Journal volume & issue
Vol. 19
p. 100814

Abstract

Read online

Road accidents are increasing worldwide and are causing millions of deaths each year. They impose significant financial and economic expenses on society. Existing research has mostly studied road accident prediction as a classification problem, which aims to predict whether a traffic accident may happen in the future or not without exploring the underneath relationships between the complicated factors contributing to road accidents. A number of research have been done to date to explore the importance of road accident contributing factors in relation to road accidents and their severity, however, only a few of those research have explored a subset of ensemble ML models and the New Zealand (NZ) road accident dataset. Therefore, in this paper, we have evaluated a set of machine learning (ML) models to predict road accident severity based on the most recent NZ road accident dataset. We have also analysed the predicted results and applied an explainable ML (XML) technique to evaluate the importance of road accident contributing factors. To predict road accidents with different injury severity, this work has considered different ensembles of ML models, like Random Forest (RF), Decision Jungle (DJ), Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (L-GBM), and Categorical Boosting (CatBoost). New Zealand road accident data from 2016 through 2020 obtained from the New Zealand Ministry of Transport is used to perform this study. The comparison results show that RF is the best classifier with 81.45% accuracy, 81.68% precision, 81.42% recall, and 81.04% of F1-Score. Next, we have employed the Shapley value analysis as an XML technique to interpret the RF model performance at global and local levels. While the global level explanation provides the rank of the features’ contribution to severity classification, the local one is for exploring the use of features in the model. Furthermore, the Shapley Additive exPlanation (SHAP) dependence plot is used to investigate the relationship and interaction of the features towards the target variable prediction. Based on the findings, it can be said that the road category and number of vehicles involved in an accident significantly impact injury severity. The identified high-ranked features through SHAP analysis are used to retrain the ML models and measure their performance. The result shows 6%, 5%, and 8%, increase, respectively, in the performances of DJ, AdaBoost, and CatBoost models.

Keywords