DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Zhuangzhuang Miao; Yong Zhang; Yuan Peng; Haocheng Peng; Baocai Yin

doi:10.1007/s41095-022-0313-5

Computational Visual Media (Apr 2023)

DTCC: Multi-level dilated convolution with transformer for weakly-supervised crowd counting

Zhuangzhuang Miao,
Yong Zhang,
Yuan Peng,
Haocheng Peng,
Baocai Yin

Affiliations

Zhuangzhuang Miao: Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology
Yong Zhang: Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology
Yuan Peng: Taiji Computer Corporation Ltd.
Haocheng Peng: Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology
Baocai Yin: Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing University of Technology

DOI: https://doi.org/10.1007/s41095-022-0313-5
Journal volume & issue: Vol. 9, no. 4
pp. 859 – 873

Abstract

Read online

Abstract Crowd counting provides an important foundation for public security and urban management. Due to the existence of small targets and large density variations in crowd images, crowd counting is a challenging task. Mainstream methods usually apply convolution neural networks (CNNs) to regress a density map, which requires annotations of individual persons and counts. Weakly-supervised methods can avoid detailed labeling and only require counts as annotations of images, but existing methods fail to achieve satisfactory performance because a global perspective field and multi-level information are usually ignored. We propose a weakly-supervised method, DTCC, which effectively combines multi-level dilated convolution and transformer methods to realize end-to-end crowd counting. Its main components include a recursive swin transformer and a multi-level dilated convolution regression head. The recursive swin transformer combines a pyramid visual transformer with a fine-tuned recursive pyramid structure to capture deep multi-level crowd features, including global features. The multi-level dilated convolution regression head includes multi-level dilated convolution and a linear regression head for the feature extraction module. This module can capture both low- and high-level features simultaneously to enhance the receptive field. In addition, two regression head fusion mechanisms realize dynamic and mean fusion counting. Experiments on four well-known benchmark crowd counting datasets (UCF_CC_50, ShanghaiTech, UCF_QNRF, and JHU-Crowd++) show that DTCC achieves results superior to other weakly-supervised methods and comparable to fully-supervised methods.

Published in Computational Visual Media

ISSN: 2096-0433 (Print); 2096-0662 (Online)
Publisher: SpringerOpen
Country of publisher: China
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.springer.com/41095

About the journal

Abstract

Keywords