IEEE Access (Jan 2023)

Transformer-Based Feature Aggregation and Stitching Network for Crowd Counting

  • Kehao Wang,
  • Yuhui Wang,
  • Ruiqi Ren,
  • Han Zou,
  • Zhichao Shao

DOI
https://doi.org/10.1109/ACCESS.2023.3329985
Journal volume & issue
Vol. 11
pp. 124833 – 124844

Abstract

Read online

With the rapid development of society, crowded scenes can be seen almost everywhere. Therefore, it is important to accurately predict the number and density distribution of people in those crowded regions by utilizing image analysis. In recent years, most of those studies based on various deep learning technologies for pedestrian image analysis have been based on mature convolution neural networks (CNN). Nowadays, vision transformer has demonstrated its competitive performance compared with CNNs in many computer vision domains, and provides a novel idea for density distribution in image. In this paper, we modify the Swin-Transformer and integrate CNN to propose a feature aggregation and stitching network (FASNet), which effectively improves the counting accuracy. The hierarchical vision transformer backbone captures the global multi-scale features of the image, and encode the interaction information among different pedestrians in the deep network. Feature Aggregation Module (FAM) is used to fuse the deep and shallow features, and then Density Regression Module (DRM) upsamples the output of FAM and finally produces the predicted crowd density map and the final count. In addition, we propose the Feature Stitching Mechanism (FSM) to cope with the feature damage or loss caused by image cropping during the model testing. The experimental results on three benchmark datasets (UCF_CC_50, UCF_QNRF, Shanghaitech) demonstrate the effectiveness of our proposed scheme.

Keywords