Learning Temporal–Spatial Contextual Adaptation for Three-Dimensional Human Pose Estimation

Hexin Wang; Wei Quan; Runjing Zhao; Miaomiao Zhang; Na Jiang

doi:10.3390/s24134422

Sensors (Jul 2024)

Learning Temporal–Spatial Contextual Adaptation for Three-Dimensional Human Pose Estimation

Hexin Wang,
Wei Quan,
Runjing Zhao,
Miaomiao Zhang,
Na Jiang

Affiliations

Hexin Wang: College of Information Engineering, Capital Normal University, Beijing 100048, China
Wei Quan: College of Information Engineering, Capital Normal University, Beijing 100048, China
Runjing Zhao: College of Information Engineering, Capital Normal University, Beijing 100048, China
Miaomiao Zhang: College of Information Engineering, Capital Normal University, Beijing 100048, China
Na Jiang: College of Information Engineering, Capital Normal University, Beijing 100048, China

DOI: https://doi.org/10.3390/s24134422
Journal volume & issue: Vol. 24, no. 13
p. 4422

Abstract

Read online

Three-dimensional human pose estimation focuses on generating 3D pose sequences from 2D videos. It has enormous potential in the fields of human–robot interaction, remote sensing, virtual reality, and computer vision. Existing excellent methods primarily focus on exploring spatial or temporal encoding to achieve 3D pose inference. However, various architectures exploit the independent effects of spatial and temporal cues on 3D pose estimation, while neglecting the spatial–temporal synergistic influence. To address this issue, this paper proposes a novel 3D pose estimation method with a dual-adaptive spatial–temporal former (DASTFormer) and additional supervised training. The DASTFormer contains attention-adaptive (AtA) and pure-adaptive (PuA) modes, which will enhance pose inference from 2D to 3D by adaptively learning spatial–temporal effects, considering both their cooperative and independent influences. In addition, an additional supervised training with batch variance loss is proposed in this work. Different from common training strategy, a two-round parameter update is conducted on the same batch data. Not only can it better explore the potential relationship between spatial–temporal encoding and 3D poses, but it can also alleviate the batch size limitations imposed by graphics cards on transformer-based frameworks. Extensive experimental results show that the proposed method significantly outperforms most state-of-the-art approaches on Human3.6 and HumanEVA datasets.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords