Virtual Reality & Intelligent Hardware (Apr 2023)
MSSTNet: Multi-scale facial videos pulse extraction network based on separable spatiotemporal convolution and dimension separable attention
Abstract
Background: Using remote photoplethysmography (rPPG) to estimate blood volume pulse in a non-contact way is an active research topic in recent years. Existing methods are mainly based on the single-scale region of interest (ROI). However, some noise signals that are not easily separated in single-scale space can be easily separated in multi-scale space. In addition, existing spatiotemporal networks mainly focus on local spatiotemporal information and lack emphasis on temporal information which is crucial in pulse extraction problems, resulting in insufficient spatiotemporal feature modeling. Methods: This paper proposes a multi-scale facial video pulse extraction network based on separable spatiotemporal convolution and dimension separable attention. First, in order to solve the problem of single-scale ROI, we construct a multi-scale feature space for initial signal separation. Secondly, separable spatiotemporal convolution and dimension separable attention are designed for efficient spatiotemporal correlation modeling, which increases the information interaction between long-span time and space dimensions and puts more emphasis on temporal features. Results: The signal-to-noise ratio (SNR) of the proposed network reaches 9.58 dB on the PURE dataset and 6.77 dB on the UBFC-rPPG dataset, which outperforms state-of-the-art algorithms. Conclusions: Results show that fusing multi-scale signals generally obtains better results than methods based on the only single-scale signal. The proposed separable spatiotemporal convolution and dimension separable attention mechanism contributes to more accurate pulse signal extraction.