IET Computer Vision (Feb 2021)

Multi‐level feature fusion network for crowd counting

  • Luyang Wang,
  • Yun Li,
  • Sifan Peng,
  • Xiao Tang,
  • Baoqun Yin

DOI
https://doi.org/10.1049/cvi2.12012
Journal volume & issue
Vol. 15, no. 1
pp. 60 – 72

Abstract

Read online

Abstract Crowd counting has become a noteworthy vision task due to the needs of numerous practical applications, but it remains challenging. State‐of‐the‐art methods generally estimate the density map of the crowd image with the high‐level semantic features of various deep convolutional networks. However, the absence of low‐level spatial information may result in counting errors in the local details of the density map. To this end, a novel framework named Multi‐level Feature Fusion Network (MFFN) for single image crowd counting is proposed. The proposed MFFN, which is constructed in an encoder–decoder fashion, incorporates semantic and spatial information for generating high‐resolution density maps of input crowd images. Skip connections are developed between the encoder and the decoder so that low‐level spatial information and high‐level semantic features can be combined by element‐wise addition. In addition, a dense dilated convolution block is placed behind the encoder, extracting multi‐scale context features to guide feature fusion by a channel attention mechanism. The model is trained by multi‐task learning; semantic segmentation supervision is introduced to enhance feature representation. Extensive experiments are conducted on three crowd counting datasets (ShanghaiTech, UCF_CC_50, UCF‐QNRF), and the results show that MFFN outperforms state‐of‐the‐art methods. In addition, sufficient ablation studies are performed to verify the effectiveness of each component in our proposed method.