STFT: Spatial and temporal feature fusion for transformer tracker

Hao Zhang; Yan Piao; Nan Qi

doi:10.1049/cvi2.12233

IET Computer Vision (Feb 2024)

STFT: Spatial and temporal feature fusion for transformer tracker

Hao Zhang,
Yan Piao,
Nan Qi

Affiliations

Hao Zhang: College of Electronic Information Engineering Changchun University of Science and Technology Changchun China
Yan Piao: College of Electronic Information Engineering Changchun University of Science and Technology Changchun China
Nan Qi: College of Electronic Information Engineering Changchun University of Science and Technology Changchun China

DOI: https://doi.org/10.1049/cvi2.12233
Journal volume & issue: Vol. 18, no. 1
pp. 165 – 176

Abstract

Read online

Abstract Siamese‐based trackers have demonstrated robust performance in object tracking, while Transformers have achieved widespread success in object detection. Currently, many researchers use a hybrid structure of convolutional neural networks and Transformers to design the backbone network of trackers, aiming to improve performance. However, this approach often underutilises the global feature extraction capability of Transformers. The authors propose a novel Transformer‐based tracker that fuses spatial and temporal features. The tracker consists of a multilayer spatial feature fusion network (MSFFN), a temporal feature fusion network (TFFN), and a prediction head. The MSFFN includes two phases: feature extraction and feature fusion, and both phases are constructed with a Transformer. Compared with the hybrid structure of “CNNs + Transformer,” the proposed method enhances the continuity of feature extraction and the ability of information interaction between features, enabling comprehensive feature extraction. Moreover, to consider the temporal dimension, the authors propose a TFFN for updating the template image. The network utilises the Transformer to fuse the tracking results of multiple frames with the initial frame, allowing the template image to continuously incorporate more information and maintain the accuracy of target features. Extensive experiments show that the tracker STFT achieves state‐of‐the‐art results on multiple benchmarks (OTB100, VOT2018, LaSOT, GOT‐10K, and UAV123). Especially, the tracker STFT achieves remarkable area under the curve score of 0.652 and 0.706 on the LaSOT and OTB100 benchmark respectively.

Published in IET Computer Vision

ISSN: 1751-9632 (Print); 1751-9640 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519640

About the journal

Abstract

Keywords