Deeper Siamese Network With Stronger Feature Representation for Visual Tracking

Chaoyi Zhang; Howard Wang; Jiwei Wen; Li Peng

doi:10.1109/ACCESS.2020.3005511

IEEE Access (Jan 2020)

Deeper Siamese Network With Stronger Feature Representation for Visual Tracking

Chaoyi Zhang,
Howard Wang,
Jiwei Wen,
Li Peng

Affiliations

Chaoyi Zhang: Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, Wuxi, China
Howard Wang: Department of Electrical, Computer, and Software Engineering, The University of Auckland, Auckland, New Zealand
Jiwei Wen: ORCiD; Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, Wuxi, China
Li Peng: Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, Wuxi, China

DOI: https://doi.org/10.1109/ACCESS.2020.3005511
Journal volume & issue: Vol. 8
pp. 119094 – 119104

Abstract

Read online

Siamese network based visual tracking has drawn considerable attention recently due to the balanced accuracy and speed. This type of method mostly trains a relatively shallow twin network offline, and measures the similarity online using cross-correlation operation between the feature maps generated by the last convolutional layer of the target and search regions to locate the object. Nevertheless, a single feature map extracted from the last layer of shallow networks is insufficient to describe target appearance, as well as sensitive to the distractors, which could mislead the similarity response map and make the tracker easily drift. To enhance the tracking accuracy and robustness while maintaining the real-time speed, based on the above tracking paradigm, three improvements including reform of backbone network, fusion of hierarchical features and utilization of channel attention mechanism, have been made in this paper. Firstly, we introduce a modified deeper VGG16 backbone network, which could extract more powerful features contributing to distinguishing the target from distractors. Secondly, we fuse diverse features extracted from deep layers and shallow layers to take advantage of both semantic and spatial information of the target. Thirdly, we incorporate a novel lightweight residual channel attention mechanism into the backbone network, which expands the weight gap between different channels and helps the network pay more attention on dominant features. Extensive experimental results on OTB100 and VOT2018 benchmarks demonstrate that our tracker performs better in accuracy and efficiency against several state-of-the-art methods in real-time scenarios.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords