IEEE Access (Jan 2024)
DivNEDS: Diverse Naturalistic Edge Driving Scene Dataset for Autonomous Vehicle Scene Understanding
Abstract
The safe implementation and adoption of Autonomous Vehicle (AV) vision models on public roads requires not only an understanding of the natural environment comprising pedestrians and other vehicles but also the ability to reason about edge situations such as unpredictable maneuvers by other drivers, impending accidents, erratic movement of pedestrians, cyclists, and motorcyclists, animal crossings, and cyclists using hand signals. Despite advances in complex tasks such as object tracking, human behavior modeling, activity recognition, and trajectory planning, the fundamental challenge of interpretable scene understanding, especially in out-of-distribution environments, remains evident. This is highlighted by the 84% of AV disengagements attributed to scene understanding errors in real-world AV tests. To address this limitation, we introduce the Diverse Naturalistic Edge Driving Scene Dataset (DivNEDS), a novel dataset comprising 11,084 edge scenes and 203,000 descriptive captions sourced from 12 distinct locations worldwide, captured under varying weather conditions and at different times of the day. Our approach includes a novel embedded hierarchical dense captioning strategy aimed at enabling few-shot learning and mitigating overfitting by excluding irrelevant scene elements. Additionally, we propose a Generative Region-to-Text Transformer, with a baseline embedded hierarchical dense captioning performance of 60.3mAP, a new benchmark for AV scene understanding models trained on dense captioned data sets. This work represents a significant step toward improving AVs’ ability to comprehend diverse, real-world edge and complex driving scenarios, thereby enhancing their safety and adaptability in dynamic environments. The dataset and instructions are available at https://github.com/johnowusuduah/DivNEDS.
Keywords