IEEE Access (Jan 2024)
End-to-End 3D Human Pose Estimation Network With Multi-Layer Feature Fusion
Abstract
The 3D human pose estimation is a technique used to determine the position of the human body in a three-dimensional space. This involves identifying body rotations, joint angles, and other pose-related information from image or video data. In this paper, we propose an end-to-end 3D human pose estimation network that is based on multi-level feature fusion.The network is composed of two main components. The first component utilizes the deepest features extracted by the backbone network. These features undergo initial data encoding and are then processed by the Semantic Information Extraction Module, which primarily consists of a multi-head self-attention mechanism. This module extracts deeper features, resulting in primary human body feature data. The second component focuses on the shallowest features and inputs them into the Global Information Processing Module, which performs global feature extraction.The features extracted from both components, along with the Bbox info (bounding box information), are collectively fed into the Iterative Regression Module. This module generates human pose data, which is then utilized to reconstruct and generate the human body using a human pose model. To evaluate the performance of our method, we train and test it on well-known benchmark datasets such as 3DPW, AGORA and MPII. Our method demonstrates exceptional performance, as it achieves a reduction of approximately 5.3% on the PA-MPJPE metric and approximately 5.1% on the MPJPE metric compared to the best model we referenced.
Keywords