International Journal of Applied Earth Observations and Geoinformation (Apr 2024)

TabCtNet: Target-aware bilateral CNN-transformer network for single object tracking in satellite videos

  • Qiqi Zhu,
  • Xin Huang,
  • Qingfeng Guan

Journal volume & issue
Vol. 128
p. 103723

Abstract

Read online

Satellite video object tracking has become an emerging technology for dynamically observing the earth, providing the possibility for tracking moving objects in a short time. Deep learning methods such as CNN-based trackers and transformer-based trackers have been widely applied for single object tracking in natural videos. The target in natural videos is captured by ground sensors, whereas satellite sensors come from high altitudes of hundreds of kilometers or more, the trackers designed for natural videos may suffer the influence of complex background, especially small targets with weak features in view of remote sensing platforms. Furtherly, the confusion of visually similar objects with the target and the deformation of target in satellite videos can also lead to incorrect positioning. To address these problems, we proposed a target-aware bilateral CNN-Transformer network (TabCtNet). In TabCtNet, the bilateral CNN-Transformer architecture with the aggregation and interaction of local spatial information and global temporal context is designed to tackle the challenge of small target with weak features in complex and clutter background in satellite videos. To effectively reduce the impact of similar objects, the target-aware block-erasing strategy (TAS) is constructed to generate weakened heatmaps from the template target mask in a data-driven manner. Moreover, a pixel-wise refinement module with corner-based box estimation (PE) is designed to extract more fine-grained spatial information for more accurate box estimation and reduce the effect of target deformation. Experimental results show that TabCtNet quantitatively and qualitatively outperforms advanced single object tracking methods on two different satellite video datasets with four categories of targets from different scenarios. Furthermore, to investigate the generalizability of the TabCtNet framework, satellite videos sourced from different countries captured by various satellite platforms were used for evaluation, and the results reveal its robust performance across various scenarios.

Keywords