DPT‐tracker: Dual pooling transformer for efficient visual tracking

Yang Fang; Bailian Xie; Uswah Khairuddin; Zijian Min; Bingbing Jiang; Weisheng Li

doi:10.1049/cit2.12296

CAAI Transactions on Intelligence Technology (Aug 2024)

DPT‐tracker: Dual pooling transformer for efficient visual tracking

Yang Fang,
Bailian Xie,
Uswah Khairuddin,
Zijian Min,
Bingbing Jiang,
Weisheng Li

Affiliations

Yang Fang: Key Laboratory of Data Engineering and Visual Computing Chongqing University of Posts and Telecommunications Chongqing China
Bailian Xie: Key Laboratory of Data Engineering and Visual Computing Chongqing University of Posts and Telecommunications Chongqing China
Uswah Khairuddin: Department of Mechanical Precision Engineering Malaysia‐Japan International Institute of Technology University of Technology Malaysia Kuala Lumpur Malaysia
Zijian Min: Department of Electrical and Computer Engineering Inha University Incheon Republic of Korea
Bingbing Jiang: School of Information Science and Technology Hangzhou Normal University Hangzhou China
Weisheng Li: Key Laboratory of Data Engineering and Visual Computing Chongqing University of Posts and Telecommunications Chongqing China

DOI: https://doi.org/10.1049/cit2.12296
Journal volume & issue: Vol. 9, no. 4
pp. 948 – 959

Abstract

Read online

Abstract Transformer tracking always takes paired template and search images as encoder input and conduct feature extraction and target‐search feature correlation by self and/or cross attention operations, thus the model complexity will grow quadratically with the number of input images. To alleviate the burden of this tracking paradigm and facilitate practical deployment of Transformer‐based trackers, we propose a dual pooling transformer tracking framework, dubbed as DPT, which consists of three components: a simple yet efficient spatiotemporal attention model (SAM), a mutual correlation pooling Transformer (MCPT) and a multiscale aggregation pooling Transformer (MAPT). SAM is designed to gracefully aggregates temporal dynamics and spatial appearance information of multi‐frame templates along space‐time dimensions. MCPT aims to capture multi‐scale pooled and correlated contextual features, which is followed by MAPT that aggregates multi‐scale features into a unified feature representation for tracking prediction. DPT tracker achieves AUC score of 69.5 on LaSOT and precision score of 82.8 on TrackingNet while maintaining a shorter sequence length of attention tokens, fewer parameters and FLOPs compared to existing state‐of‐the‐art (SOTA) Transformer tracking methods. Extensive experiments demonstrate that DPT tracker yields a strong real‐time tracking baseline with a good trade‐off between tracking performance and inference efficiency.

Published in CAAI Transactions on Intelligence Technology

ISSN: 2468-2322 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Language and Literature: Philology. Linguistics: Computational linguistics. Natural language processing; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/24682322

About the journal

Abstract

Keywords