Cross-scale Vision Transformer for crowd localization

Shuang Liu; Yu Lian; Zhong Zhang; Baihua Xiao; Tariq S. Durrani

Journal of King Saud University: Computer and Information Sciences (Feb 2024)

Cross-scale Vision Transformer for crowd localization

Shuang Liu,
Yu Lian,
Zhong Zhang,
Baihua Xiao,
Tariq S. Durrani

Affiliations

Shuang Liu: Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin, 300387, China
Yu Lian: Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin, 300387, China
Zhong Zhang: Tianjin Key Laboratory of Wireless Mobile Communications and Power Transmission, Tianjin Normal University, Tianjin, 300387, China; Corresponding author.
Baihua Xiao: The State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Tariq S. Durrani: Department of Electronic and Electrical Engineering, University of Strathclyde, Glasgow Scotland, UK

Journal volume & issue: Vol. 36, no. 2
p. 101972

Abstract

Read online

Crowd localization can provide the positions of individuals and the total number of people, which has great application value for security monitoring and public management, meanwhile it meets the challenges of lighting, occlusion and perspective effect. In recent times, Transformer has been applied in crowd localization to overcome these challenges. Yet such kind of methods only consider to integrate the multi-scale information once, which results in incomplete multi-scale information fusion. In this paper, we propose a novel Transformer network named Cross-scale Vision Transformer (CsViT) for crowd localization, which simultaneously fuses multi-scale information during both the encoder and decoder stages and meanwhile building the long-range context dependencies on the combined feature maps. To this end, we design the multi-scale encoder to fuse the feature maps of multiple scales at corresponding positions so as to obtain the combined feature maps, and meanwhile design the multi-scale decoder to integrate the tokens at multiple scales when modeling the long-range context dependencies. Furthermore, we propose Multi-scale SSIM (MsSSIM) loss to adaptively compute head regions and optimize the similarity at multiple scales. Specifically, we set the adaptive windows with different scales for each head and compute the loss values within these windows so as to enhance the accuracy of the predicted distance transform map. We perform comprehensive experiments on five public datasets, and the results obtained validate the effectiveness of our method.

Published in Journal of King Saud University: Computer and Information Sciences

ISSN: 1319-1578 (Print)
Publisher: Elsevier
Country of publisher: Saudi Arabia
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: http://www.journals.elsevier.com/journal-of-king-saud-university-computer-and-information-sciences/

About the journal

Abstract

Keywords