End‐to‐end feature fusion Siamese network for adaptive visual tracking

Dongyan Guo; Jun Wang; Weixuan Zhao; Ying Cui; Zhenhua Wang; Shengyong Chen

doi:10.1049/ipr2.12009

IET Image Processing (Jan 2021)

End‐to‐end feature fusion Siamese network for adaptive visual tracking

Dongyan Guo,
Jun Wang,
Weixuan Zhao,
Ying Cui,
Zhenhua Wang,
Shengyong Chen

Affiliations

Dongyan Guo: College of Computer Science and Technology Zhejiang University of Technology Hangzhou Zhejiang China
Jun Wang: College of Computer Science and Technology Zhejiang University of Technology Hangzhou Zhejiang China
Weixuan Zhao: College of Computer Science and Technology Zhejiang University of Technology Hangzhou Zhejiang China
Ying Cui: College of Computer Science and Technology Zhejiang University of Technology Hangzhou Zhejiang China
Zhenhua Wang: College of Computer Science and Technology Zhejiang University of Technology Hangzhou Zhejiang China
Shengyong Chen: School of Computer Science and Engineering Tianjin University of Technology Tianjin China

DOI: https://doi.org/10.1049/ipr2.12009
Journal volume & issue: Vol. 15, no. 1
pp. 91 – 100

Abstract

Read online

Abstract According to observations, different visual objects have different salient features in different scenarios. Even for the same object, its salient shape and appearance features may change greatly from time to time in a long‐term tracking task. Motivated by them, an end‐to‐end feature fusion framework was proposed based on the Siamese network, named FF‐Siam, which can effectively fuse different features for adaptive visual tracking. The framework consists of four layers. A feature extraction layer is designed to extract the different features of the target region and search region. The extracted features are then put into a weight generation layer to obtain the channel weights, which indicate the importance of different feature channels. Both features and the channel weights are utilised in a template generation layer to generate a discriminative template. Finally, the corresponding response maps created by the convolution of the search region features and the template are applied with a fusion layer to obtain the final response map for locating the target. Experimental results demonstrate that the proposed framework achieves state‐of‐the‐art performance on the popular Temple‐Colour, OTB50 and UAV123 benchmarks.

Published in IET Image Processing

ISSN: 1751-9659 (Print); 1751-9667 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Technology: Photography; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519667

About the journal

Abstract

Keywords