Cofopose: Conditional 2D Pose Estimation with Transformers

Evans Aidoo; Xun Wang; Zhenguang Liu; Edwin Kwadwo Tenagyei; Kwabena Owusu-Agyemang; Seth Larweh Kodjiku; Victor Nonso Ejianya; Esther Stacy E. B. Aggrey

doi:10.3390/s22186821

Sensors (Sep 2022)

Cofopose: Conditional 2D Pose Estimation with Transformers

Evans Aidoo,
Xun Wang,
Zhenguang Liu,
Edwin Kwadwo Tenagyei,
Kwabena Owusu-Agyemang,
Seth Larweh Kodjiku,
Victor Nonso Ejianya,
Esther Stacy E. B. Aggrey

Affiliations

Evans Aidoo: School of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
Xun Wang: School of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
Zhenguang Liu: School of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
Edwin Kwadwo Tenagyei: School of Information & Software Engineering, University of Electronic Science & Technology of China, Chengdu 611731, China
Kwabena Owusu-Agyemang: Department of Computer Science, Kwame Nkrumah University of Science and Technology (KNUST), Kumasi 03220, Ghana
Seth Larweh Kodjiku: School of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
Victor Nonso Ejianya: School of Computer & Information Engineering, Zhejiang Gongshang University, Hangzhou 310018, China
Esther Stacy E. B. Aggrey: School of Information & Software Engineering, University of Electronic Science & Technology of China, Chengdu 611731, China

DOI: https://doi.org/10.3390/s22186821
Journal volume & issue: Vol. 22, no. 18
p. 6821

Abstract

Read online

Human pose estimation has long been a fundamental problem in computer vision and artificial intelligence. Prominent among the 2D human pose estimation (HPE) methods are the regression-based approaches, which have been proven to achieve excellent results. However, the ground-truth labels are usually inherently ambiguous in challenging cases such as motion blur, occlusions, and truncation, leading to poor performance measurement and lower levels of accuracy. In this paper, we propose Cofopose, which is a two-stage approach consisting of a person and keypoint detection transformers for 2D human pose estimation. Cofopose is composed of conditional cross-attention, a conditional DEtection TRansformer (conditional DETR), and an encoder-decoder in the transformer framework; this allows it to achieve person and keypoint detection. In a significant departure from other approaches, we use conditional cross-attention and fine-tune conditional DETR for our person detection, and encoder-decoders in the transformers for our keypoint detection. Cofopose was extensively evaluated using two benchmark datasets, MS COCO and MPII, achieving an improved performance with significant margins over the existing state-of-the-art frameworks.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords