Multi‐level feature fusion network for crowd counting

Luyang Wang; Yun Li; Sifan Peng; Xiao Tang; Baoqun Yin

doi:10.1049/cvi2.12012

IET Computer Vision (Feb 2021)

Multi‐level feature fusion network for crowd counting

Luyang Wang,
Yun Li,
Sifan Peng,
Xiao Tang,
Baoqun Yin

Affiliations

Luyang Wang: Department of Automation University of Science and Technology of China Hefei China
Yun Li: Department of Automation University of Science and Technology of China Hefei China
Sifan Peng: Department of Automation University of Science and Technology of China Hefei China
Xiao Tang: Department of Electronic Science and Technology University of Science and Technology of China Hefei China
Baoqun Yin: Department of Automation University of Science and Technology of China Hefei China

DOI: https://doi.org/10.1049/cvi2.12012
Journal volume & issue: Vol. 15, no. 1
pp. 60 – 72

Abstract

Read online

Abstract Crowd counting has become a noteworthy vision task due to the needs of numerous practical applications, but it remains challenging. State‐of‐the‐art methods generally estimate the density map of the crowd image with the high‐level semantic features of various deep convolutional networks. However, the absence of low‐level spatial information may result in counting errors in the local details of the density map. To this end, a novel framework named Multi‐level Feature Fusion Network (MFFN) for single image crowd counting is proposed. The proposed MFFN, which is constructed in an encoder–decoder fashion, incorporates semantic and spatial information for generating high‐resolution density maps of input crowd images. Skip connections are developed between the encoder and the decoder so that low‐level spatial information and high‐level semantic features can be combined by element‐wise addition. In addition, a dense dilated convolution block is placed behind the encoder, extracting multi‐scale context features to guide feature fusion by a channel attention mechanism. The model is trained by multi‐task learning; semantic segmentation supervision is introduced to enhance feature representation. Extensive experiments are conducted on three crowd counting datasets (ShanghaiTech, UCF_CC_50, UCF‐QNRF), and the results show that MFFN outperforms state‐of‐the‐art methods. In addition, sufficient ablation studies are performed to verify the effectiveness of each component in our proposed method.

Published in IET Computer Vision

ISSN: 1751-9632 (Print); 1751-9640 (Online)
Publisher: Wiley
Country of publisher: United Kingdom
LCC subjects: Medicine: Medicine (General): Computer applications to medicine. Medical informatics; Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://ietresearch.onlinelibrary.wiley.com/journal/17519640

About the journal