Applied Sciences (Mar 2025)

Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captions

  • Minjun Cho,
  • Sungwoo Kim,
  • Dooho Choi,
  • Yunsick Sung

DOI
https://doi.org/10.3390/app15073712
Journal volume & issue
Vol. 15, no. 7
p. 3712

Abstract

Read online

Autonomous driving technology has advanced significantly. However, it is challenging to accurately generate captions for driving environment scenes, which involve dynamic elements such as vehicles, traffic signals, road conditions, weather, and the time of day. Capturing these elements accurately is important for improving situational awareness in autonomous systems. Driving environment scene captioning is an important part of generating driving scenarios and enhancing the interpretability of autonomous systems. However, traditional vision–language models struggle with domain adaptation since autonomous driving datasets with detailed captions of dashcam-recorded scenes are limited and they cannot effectively capture diverse driving environment factors. In this paper, we propose an enhanced method based on the bootstrapping language-image pre-training with frozen vision encoders and large language Model (BLIP-2) to optimize the domain adaptation by improving scene captioning in autonomous driving environments. It comprises two steps: (1) transforming structured dataset labels into descriptive captions in natural language using a large language model (LLM), and (2) optimizing Q-Former in a BLIP-2 module with low-rank adaptation (LoRA) to achieve efficient domain adaptation. The structured dataset labels are originally stored in JSON format, where driving environment scene factors are encoded as key-value pairs that represent attributes such as the object type, position, and state. Using the Large-Scale Diverse Driving Video Database (BDD-100K), our method significantly improves performance, achieving BLEU-4, CIDEr, and SPICE scores that were each approximately 1.5 times those for the baseline BLIP-2, respectively. Higher scores show the effectiveness of LoRA-based optimization and, hence, its suitability for autonomous driving applications. Our method effectively enhances accuracy, contextual relevance, and interpretability, contributing to improved scene understanding in autonomous driving systems.

Keywords