One-Shot Multiple Object Tracking in UAV Videos Using Task-Specific Fine-Grained Features

Han Wu; Jiahao Nie; Zhiwei He; Ziming Zhu; Mingyu Gao

doi:10.3390/rs14163853

Remote Sensing (Aug 2022)

One-Shot Multiple Object Tracking in UAV Videos Using Task-Specific Fine-Grained Features

Han Wu,
Jiahao Nie,
Zhiwei He,
Ziming Zhu,
Mingyu Gao

Affiliations

Han Wu: The School of Electronic Information, Hangzhou Dianzi University, Hangzhou 310018, China
Jiahao Nie: The School of Electronic Information, Hangzhou Dianzi University, Hangzhou 310018, China
Zhiwei He: The School of Electronic Information, Hangzhou Dianzi University, Hangzhou 310018, China
Ziming Zhu: The School of Electronic Information, Hangzhou Dianzi University, Hangzhou 310018, China
Mingyu Gao: The School of Electronic Information, Hangzhou Dianzi University, Hangzhou 310018, China

DOI: https://doi.org/10.3390/rs14163853
Journal volume & issue: Vol. 14, no. 16
p. 3853

Abstract

Read online

Multiple object tracking (MOT) in unmanned aerial vehicle (UAV) videos is a fundamental task and can be applied in many fields. MOT consists of two critical procedures, i.e., object detection and re-identification (ReID). One-shot MOT, which incorporates detection and ReID in a unified network, has gained attention due to its fast inference speed. It significantly reduces the computational overhead by making two subtasks share features. However, most existing one-shot trackers struggle to achieve robust tracking in UAV videos. We observe that the essential difference between detection and ReID leads to an optimization contradiction within one-shot networks. To alleviate this contradiction, we propose a novel feature decoupling network (FDN) to convert shared features into detection-specific and ReID-specific representations. The FDN searches for characteristics and commonalities between the two tasks to synergize detection and ReID. In addition, existing one-shot trackers struggle to locate small targets in UAV videos. Therefore, we design a pyramid transformer encoder (PTE) to enrich the semantic information of the resulting detection-specific representations. By learning scale-aware fine-grained features, the PTE empowers our tracker to locate targets in UAV videos accurately. Extensive experiments on VisDrone2021 and UAVDT benchmarks demonstrate that our tracker achieves state-of-the-art tracking performance.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords