GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

Haicheng Liao; Huanming Shen; Zhenning Li; Chengyue Wang; Guofa Li; Yiming Bie; Chengzhong Xu

doi:10.1016/j.commtr.2023.100116

Communications in Transportation Research (Dec 2024)

GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

Haicheng Liao,
Huanming Shen,
Zhenning Li,
Chengyue Wang,
Guofa Li,
Yiming Bie,
Chengzhong Xu

Affiliations

Haicheng Liao: State Key Laboratory of Internet of Things for Smart City and Department of Computer and Information Science, University of Macau, Macau SAR, 999078, China
Huanming Shen: Department of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, 610000, China
Zhenning Li: State Key Laboratory of Internet of Things for Smart City and Departments of Civil and Environmental Engineering and Computer and Information Science, University of Macau, Macau SAR, 999078, China; Corresponding author.
Chengyue Wang: State Key Laboratory of Internet of Things for Smart City and Departments of Civil and Environmental Engineering, University of Macau, Macau SAR, 999078, China
Guofa Li: College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, 400030, China
Yiming Bie: School of Transportation, Jilin University, Changchun, 130000, China
Chengzhong Xu: State Key Laboratory of Internet of Things for Smart City and Department of Computer and Information Science, University of Macau, Macau SAR, 999078, China; Corresponding author.

DOI: https://doi.org/10.1016/j.commtr.2023.100116
Journal volume & issue: Vol. 4
p. 100116

Abstract

Read online

In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs. Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders—Text, Emotion, Image, Context, and Cross-Modal—with a multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.

Published in Communications in Transportation Research

ISSN: 2772-4247 (Online)
Publisher: Elsevier
Country of publisher: United Kingdom
LCC subjects: Technology: Engineering (General). Civil engineering (General): Transportation engineering
Website: https://www.journals.elsevier.com/communications-in-transportation-research

About the journal

Abstract

Keywords