AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

Ruihan Hu; Qinglong Mo; Yuanfei Xie; Yongqian Xu; Jiaqi Chen; Yalun Yang; Hongjian Zhou; Zhi-Ri Tang; Edmond Q. Wu

doi:10.1109/ACCESS.2021.3074797

IEEE Access (Jan 2021)

AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

Ruihan Hu,
Qinglong Mo,
Yuanfei Xie,
Yongqian Xu,
Jiaqi Chen,
Yalun Yang,
Hongjian Zhou,
Zhi-Ri Tang,
Edmond Q. Wu

Affiliations

Ruihan Hu: ORCiD; Guangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, China
Qinglong Mo: Guangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, China
Yuanfei Xie: Electronic Information School, Wuhan University, Wuhan, China
Yongqian Xu: Guangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, China
Jiaqi Chen: Guangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, China
Yalun Yang: Department of Automation, Shanghai Jiao Tong University, Shanghai, China
Hongjian Zhou: School of Mechanical and Electrical Engineering, Wuhan Institute of Technology, Wuhan, China
Zhi-Ri Tang: School of Physics and Technology, Wuhan University, Wuhan, China
Edmond Q. Wu: ORCiD; Department of Automation, Shanghai Jiao Tong University, Shanghai, China

DOI: https://doi.org/10.1109/ACCESS.2021.3074797
Journal volume & issue: Vol. 9
pp. 80500 – 80510

Abstract

Read online

Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords