Attention-Guided Disentangled Feature Aggregation for Video Object Detection

Shishir Muralidhara; Khurram Azeem Hashmi; Alain Pagani; Marcus Liwicki; Didier Stricker; Muhammad Zeshan Afzal

doi:10.3390/s22218583

Sensors (Nov 2022)

Attention-Guided Disentangled Feature Aggregation for Video Object Detection

Shishir Muralidhara,
Khurram Azeem Hashmi,
Alain Pagani,
Marcus Liwicki,
Didier Stricker,
Muhammad Zeshan Afzal

Affiliations

Shishir Muralidhara: Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
Khurram Azeem Hashmi: Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
Alain Pagani: German Research Institute for Artificial Intelligence (DFKI), 67663 Kaiserslautern, Germany
Marcus Liwicki: Department of Computer Science, Luleå University of Technology, 971 87 Luleå, Sweden
Didier Stricker: Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany
Muhammad Zeshan Afzal: Department of Computer Science, Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany

DOI: https://doi.org/10.3390/s22218583
Journal volume & issue: Vol. 22, no. 21
p. 8583

Abstract

Read online

Object detection is a computer vision task that involves localisation and classification of objects in an image. Video data implicitly introduces several challenges, such as blur, occlusion and defocus, making video object detection more challenging in comparison to still image object detection, which is performed on individual and independent images. This paper tackles these challenges by proposing an attention-heavy framework for video object detection that aggregates the disentangled features extracted from individual frames. The proposed framework is a two-stage object detector based on the Faster R-CNN architecture. The disentanglement head integrates scale, spatial and task-aware attention and applies it to the features extracted by the backbone network across all the frames. Subsequently, the aggregation head incorporates temporal attention and improves detection in the target frame by aggregating the features of the support frames. These include the features extracted from the disentanglement network along with the temporal features. We evaluate the proposed framework using the ImageNet VID dataset and achieve a mean Average Precision (mAP) of 49.8 and 52.5 using the backbones of ResNet-50 and ResNet-101, respectively. The improvement in performance over the individual baseline methods validates the efficacy of the proposed approach.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords