Multi‐dimensional weighted cross‐attention network in crowded scenes

Yefan Xie; Jiangbin Zheng; Xuan Hou; Irfan Raza Naqvi; Yue Xi; Nailiang Kuang

doi:10.1049/ipr2.12298

IET Image Processing (Dec 2021)

Multi‐dimensional weighted cross‐attention network in crowded scenes

Yefan Xie,
Jiangbin Zheng,
Xuan Hou,
Irfan Raza Naqvi,
Yue Xi,
Nailiang Kuang

Affiliations

Yefan Xie: School of Computer Science and Engineering Northwestern Polytechnical University Xi'an PR China
Jiangbin Zheng: School of Computer Science and Engineering Northwestern Polytechnical University Xi'an PR China
Xuan Hou: School of Computer Science and Engineering Northwestern Polytechnical University Xi'an PR China
Irfan Raza Naqvi: School of Software Northwestern Polytechnical University Xi'an PR China
Yue Xi: Aeronautics Engineering College Air Force Engineering University of PLA Xi'an China
Nailiang Kuang: Xi'an Microelectronics Technology Institute Xi'an PR China

DOI: https://doi.org/10.1049/ipr2.12298
Journal volume & issue: Vol. 15, no. 14
pp. 3585 – 3598

Abstract

Read online

Abstract Human detection in crowded scenes is one of the research components of crowd safety problem analysis, such as emergency warning and security monitoring platforms. Although the existing anchor‐free methods have fast inference speed, they are not suitable for object detection in crowded scenes due to the model's inability to predict the well‐fined object detection bounding boxes. This work proposes an end‐to‐end anchor‐free network, Multi‐dimensional Weighted Cross‐Attention Network (MANet), which can perform real‐time human detection in crowded scenes. Specifically, the Double‐flow Weighted Feature Cascade Module (DW‐FCM) is used in the extractor to highlight the contribution of features at different levels. The Triplet Cross Attention Module (TCAM) is used in the detector head to enhance the association dependence of multi‐dimension features, further strengthening human boundary features' discrimination ability at a fine‐grained level. Moreover, the strategy of Adaptively Opposite Thrust Mapping (AOTM) ground‐truth annotation is proposed to achieve bias correction of erroneous mappings and reduce the iterations of useless learning of the network. These strategies effectively alleviate the defect that the existing anchor‐free network cannot correctly distinguish and locate the individual human in crowded scenes. Compared with the anchor‐based detection method, there is no need to set anchor parameters manually, and the detection speed can satisfy the real‐time application. Finally, through extensive comparative experiments on CrowdHuman and WIDER FACE datasets, the results demonstrate that the improved strategy achieves the state‐of‐the‐art result in the anchor‐free methods.

Published in IET Image Processing

ISSN: 1751-9659 (Print); 1751-9667 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Technology: Photography; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519667

About the journal

Abstract

Keywords