IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

Faster Transformer-DS: Multiscale Vehicle Detection of Remote-Sensing Images Based on Transformer and Distance-Scale Loss

  • Jiahuan Zhang,
  • Hengzhen Liu,
  • Yi Zhang,
  • Menghan Li,
  • Zongqian Zhan

DOI
https://doi.org/10.1109/JSTARS.2023.3335283
Journal volume & issue
Vol. 17
pp. 1961 – 1975

Abstract

Read online

Vehicle detection (VD) on remote sensing (RS) images has gained impressive achievements these years, mainly thanks to the development of popular learning-based object detection architectures [e.g., faster R-convolutional neural network (CNN), YOLO series, etc.]. However, for RS images, mutiscale VD with tiny size still remains challenging. Particularly, vehicles as tiny objects typically contain only a few pixels with very rare information for model training and validation, which can result in inaccurate localization and intricate classification. In this article, we present a new detection model called faster transformer-DS, in which two improvements are proposed and discussed as follows. 1) Instead of CNN-based ResNet50, a transformer-based backbone—pyramid vision transformer v2-b0 that can extract global context information, is investigated as feature extraction backbone for both object classification and bounding box regression. (2) In contrast to the conventional paradigm based and intersection over union based loss functions, to further fine-tune the localization of predicted bounding box, we proposed a novel distance-scale loss function—the distance loss is directly related to the absolute value of the length and width of the bounding box together with its center position, while the scale loss refers to the geometric shape ratio between the bounding box and the ground truth box. To demonstrate the efficacy of our model, comprehensive experiments are conducted to show that both the proposed methods can boost the performance of multiscale VD on RS images and the best result, all of which are confirmed by experimental data indicators and visual detection results. Additionally, several state-of-the-art methods are compared and consequently, our model is notably superior to many common detection models.

Keywords