Military Image Captioning for Low-Altitude UAV or UGV Perspectives

Lizhi Pan; Chengtian Song; Xiaozheng Gan; Keyu Xu; Yue Xie

doi:10.3390/drones8090421

Drones (Aug 2024)

Military Image Captioning for Low-Altitude UAV or UGV Perspectives

Lizhi Pan,
Chengtian Song,
Xiaozheng Gan,
Keyu Xu,
Yue Xie

Affiliations

Lizhi Pan: School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China
Chengtian Song: School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China
Xiaozheng Gan: School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China
Keyu Xu: School of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China
Yue Xie: Science and Technology on Electromechanical Dynamic Control Laboratory, Xi’an 710065, China

DOI: https://doi.org/10.3390/drones8090421
Journal volume & issue: Vol. 8, no. 9
p. 421

Abstract

Read online

Low-altitude unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs), which boast high-resolution imaging and agile maneuvering capabilities, are widely utilized in military scenarios and generate a vast amount of image data that can be leveraged for textual intelligence generation to support military decision making. Military image captioning (MilitIC), as a visual-language learning task, provides innovative solutions for military image understanding and intelligence generation. However, the scarcity of military image datasets hinders the advancement of MilitIC methods, especially those based on deep learning. To overcome this limitation, we introduce an open-access benchmark dataset, which was termed the Military Objects in Real Combat (MOCO) dataset. It features real combat images captured from the perspective of low-altitude UAVs or UGVs, along with a comprehensive set of captions. Furthermore, we propose a novel encoder–augmentation–decoder image-captioning architecture with a map augmentation embedding (MAE) mechanism, MAE-MilitIC, which leverages both image and text modalities as a guiding prefix for caption generation and bridges the semantic gap between visual and textual data. The MAE mechanism maps both image and text embeddings onto a semantic subspace constructed by relevant military prompts, and augments the military semantics of the image embeddings with attribute-explicit text embeddings. Finally, we demonstrate through extensive experiments that MAE-MilitIC surpasses existing models in performance on two challenging datasets, which provides strong support for intelligence warfare based on military UAVs and UGVs.

Published in Drones

ISSN: 2504-446X (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Motor vehicles. Aeronautics. Astronautics
Website: http://www.mdpi.com/journal/drones

About the journal

Abstract

Keywords