DivNEDS: Diverse Naturalistic Edge Driving Scene Dataset for Autonomous Vehicle Scene Understanding

John Owusu Duah; Armstrong Aboah; Stephen Osafo-Gyamfi

doi:10.1109/ACCESS.2024.3394530

IEEE Access (Jan 2024)

DivNEDS: Diverse Naturalistic Edge Driving Scene Dataset for Autonomous Vehicle Scene Understanding

John Owusu Duah,
Armstrong Aboah,
Stephen Osafo-Gyamfi

Affiliations

John Owusu Duah: ORCiD; Artificial Intelligence Department, Cloud Software Group, Raleigh, NC, USA
Armstrong Aboah: ORCiD; Department of Civil, Construction and Environmental Engineering, North Dakota State University, Fargo, ND, USA
Stephen Osafo-Gyamfi: ORCiD; Civil Engineering Department, University of Mississippi, Oxford, MS, USA

DOI: https://doi.org/10.1109/ACCESS.2024.3394530
Journal volume & issue: Vol. 12
pp. 60628 – 60640

Abstract

Read online

The safe implementation and adoption of Autonomous Vehicle (AV) vision models on public roads requires not only an understanding of the natural environment comprising pedestrians and other vehicles but also the ability to reason about edge situations such as unpredictable maneuvers by other drivers, impending accidents, erratic movement of pedestrians, cyclists, and motorcyclists, animal crossings, and cyclists using hand signals. Despite advances in complex tasks such as object tracking, human behavior modeling, activity recognition, and trajectory planning, the fundamental challenge of interpretable scene understanding, especially in out-of-distribution environments, remains evident. This is highlighted by the 84% of AV disengagements attributed to scene understanding errors in real-world AV tests. To address this limitation, we introduce the Diverse Naturalistic Edge Driving Scene Dataset (DivNEDS), a novel dataset comprising 11,084 edge scenes and 203,000 descriptive captions sourced from 12 distinct locations worldwide, captured under varying weather conditions and at different times of the day. Our approach includes a novel embedded hierarchical dense captioning strategy aimed at enabling few-shot learning and mitigating overfitting by excluding irrelevant scene elements. Additionally, we propose a Generative Region-to-Text Transformer, with a baseline embedded hierarchical dense captioning performance of 60.3mAP, a new benchmark for AV scene understanding models trained on dense captioned data sets. This work represents a significant step toward improving AVs’ ability to comprehend diverse, real-world edge and complex driving scenarios, thereby enhancing their safety and adaptability in dynamic environments. The dataset and instructions are available at https://github.com/johnowusuduah/DivNEDS.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords