IEEE Access (Jan 2024)
Multi-View Human Mesh Reconstruction via Direction-Aware Feature Fusion
Abstract
Although there are many advantages (e.g., occlusion-robust properties) to be gained by using multi-view inputs for human mesh reconstruction, relatively few studies have been conducted due to the complexity of the fusion process. In this paper, we delve into the method to accurately fuse features encoded from multi-view inputs. The key idea of the proposed method is to combine multi-view image features by adopting the self-attention mechanism with directional encoding. Specifically, backbone features obtained from each camera viewpoint are encoded into pose and shape features along with the directional vector, which is defined by the positional difference between each camera and the target subject. Such encoded features are fed into the transformer to generate the fused feature through the self-attention operation. During this process, the dimensions of the directional vector are expanded and incorporated into the encoder, along with pose and shape features. Given the token embeddings, key and query matrices undergo matrix multiplication to calculate the correlations between all the viewpoints. These correlation scores are then used to adaptively generate features fused from different viewpoints. Moreover, we use these fused features to interact with global tokens, which represent learnable pose and shape embeddings for the unified mesh model. These tokens progressively recalibrate global pose and shape features of the target subject in a cross-attention process. Experimental results on multi-view benchmark datasets demonstrate the effectiveness of the proposed method.
Keywords