MSSTNet: Multi-scale facial videos pulse extraction network based on separable spatiotemporal convolution and dimension separable attention

Changchen Zhao; Hongsheng Wang; Yuanjing Feng

Virtual Reality & Intelligent Hardware (Apr 2023)

MSSTNet: Multi-scale facial videos pulse extraction network based on separable spatiotemporal convolution and dimension separable attention

Changchen Zhao,
Hongsheng Wang,
Yuanjing Feng

Affiliations

Changchen Zhao: College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China; Hangzhou Innovation Institute, Beihang University, Hangzhou 310053, China
Hongsheng Wang: College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
Yuanjing Feng: College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China; Corresponding author.

Journal volume & issue: Vol. 5, no. 2
pp. 124 – 141

Abstract

Read online

Background: Using remote photoplethysmography (rPPG) to estimate blood volume pulse in a non-contact way is an active research topic in recent years. Existing methods are mainly based on the single-scale region of interest (ROI). However, some noise signals that are not easily separated in single-scale space can be easily separated in multi-scale space. In addition, existing spatiotemporal networks mainly focus on local spatiotemporal information and lack emphasis on temporal information which is crucial in pulse extraction problems, resulting in insufficient spatiotemporal feature modeling. Methods: This paper proposes a multi-scale facial video pulse extraction network based on separable spatiotemporal convolution and dimension separable attention. First, in order to solve the problem of single-scale ROI, we construct a multi-scale feature space for initial signal separation. Secondly, separable spatiotemporal convolution and dimension separable attention are designed for efficient spatiotemporal correlation modeling, which increases the information interaction between long-span time and space dimensions and puts more emphasis on temporal features. Results: The signal-to-noise ratio (SNR) of the proposed network reaches 9.58 dB on the PURE dataset and 6.77 dB on the UBFC-rPPG dataset, which outperforms state-of-the-art algorithms. Conclusions: Results show that fusing multi-scale signals generally obtains better results than methods based on the only single-scale signal. The proposed separable spatiotemporal convolution and dimension separable attention mechanism contributes to more accurate pulse signal extraction.

Published in Virtual Reality & Intelligent Hardware

ISSN: 2096-5796 (Print); 2666-1209 (Online)
Publisher: KeAi Communications Co., Ltd.
Country of publisher: China
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware
Website: https://www.keaipublishing.com/en/journals/virtual-reality-and-intelligent-hardware/

About the journal

Abstract

Keywords