Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events

Mohammad Abu Tami; Huthaifa I. Ashqar; Mohammed Elhenawy; Sebastien Glaser; Andry Rakotonirainy

doi:10.3390/vehicles6030074

Vehicles (Sep 2024)

Using Multimodal Large Language Models (MLLMs) for Automated Detection of Traffic Safety-Critical Events

Mohammad Abu Tami,
Huthaifa I. Ashqar,
Mohammed Elhenawy,
Sebastien Glaser,
Andry Rakotonirainy

Affiliations

Mohammad Abu Tami: Natural, Engineering and Technology Sciences Department, Arab American University, Jenin P.O Box 240, Palestine
Huthaifa I. Ashqar: Civil Engineering Department, Arab American University, Jenin P.O Box 240, Palestine
Mohammed Elhenawy: CARRS-Q, Queensland University of Technology, Kelvin Grove, QLD 4059, Australia
Sebastien Glaser: CARRS-Q, Queensland University of Technology, Kelvin Grove, QLD 4059, Australia
Andry Rakotonirainy: CARRS-Q, Queensland University of Technology, Kelvin Grove, QLD 4059, Australia

DOI: https://doi.org/10.3390/vehicles6030074
Journal volume & issue: Vol. 6, no. 3
pp. 1571 – 1590

Abstract

Read online

Traditional approaches to safety event analysis in autonomous systems have relied on complex machine and deep learning models and extensive datasets for high accuracy and reliability. However, the emerge of multimodal large language models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities. Our framework leverages the logical and visual reasoning power of MLLMs, directing their output through object-level question–answer (QA) prompts to ensure accurate, reliable, and actionable insights for investigating safety-critical event detection and analysis. By incorporating models like Gemini-Pro-Vision 1.5, we aim to automate safety-critical event detection and analysis along with mitigating common issues such as hallucinations in MLLM outputs. The results demonstrate the framework’s potential in different in-context learning (ICT) settings such as zero-shot and few-shot learning methods. Furthermore, we investigate other settings such as self-ensemble learning and a varying number of frames. The results show that a few-shot learning model consistently outperformed other learning models, achieving the highest overall accuracy of about 79%. The comparative analysis with previous studies on visual reasoning revealed that previous models showed moderate performance in driving safety tasks, while our proposed model significantly outperformed them. To the best of our knowledge, our proposed MLLM model stands out as the first of its kind, capable of handling multiple tasks for each safety-critical event. It can identify risky scenarios, classify diverse scenes, determine car directions, categorize agents, and recommend the appropriate actions, setting a new standard in safety-critical event management. This study shows the significance of MLLMs in advancing the analysis of naturalistic driving videos to improve safety-critical event detection and understanding the interactions in complex environments.

Published in Vehicles

ISSN: 2624-8921 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Mechanical engineering and machinery: Machine design and drawing; Technology: Motor vehicles. Aeronautics. Astronautics
Website: https://www.mdpi.com/journal/vehicles

About the journal

Abstract

Keywords