Localization and recognition of human action in 3D using transformers

Jiankai Sun; Linjiang Huang; Hongsong Wang; Chuanyang Zheng; Jianing Qiu; Md Tauhidul Islam; Enze Xie; Bolei Zhou; Lei Xing; Arjun Chandrasekaran; Michael J. Black

doi:10.1038/s44172-024-00272-7

Communications Engineering (Sep 2024)

Localization and recognition of human action in 3D using transformers

Jiankai Sun,
Linjiang Huang,
Hongsong Wang,
Chuanyang Zheng,
Jianing Qiu,
Md Tauhidul Islam,
Enze Xie,
Bolei Zhou,
Lei Xing,
Arjun Chandrasekaran,
Michael J. Black

Affiliations

Jiankai Sun: School of Engineering, Stanford University
Linjiang Huang: Department of Information Engineering, The Chinese University of Hong Kong
Hongsong Wang: Department of Computer Science and Engineering, Southeast University
Chuanyang Zheng: Department of Computer Science and Engineering, The Chinese University of Hong Kong
Jianing Qiu: Department of Biomedical Engineering, The Chinese University of Hong Kong
Md Tauhidul Islam: Department of Radiation Oncology, Stanford University
Enze Xie: Department of Computer Science, The University of Hong Kong
Bolei Zhou: Department of Computer Science, University of California, Los Angeles
Lei Xing: School of Engineering, Stanford University
Arjun Chandrasekaran: Perceiving Systems Department, Max Planck Institute for Intelligent Systems
Michael J. Black: Perceiving Systems Department, Max Planck Institute for Intelligent Systems

DOI: https://doi.org/10.1038/s44172-024-00272-7
Journal volume & issue: Vol. 3, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Understanding a person’s behavior from their 3D motion sequence is a fundamental problem in computer vision with many applications. An important component of this problem is 3D action localization, which involves recognizing what actions a person is performing, and when the actions occur in the sequence. To promote the progress of the 3D action localization community, we introduce a new, challenging, and more complex benchmark dataset, BABEL-TAL (BT), for 3D action localization. Important baselines and evaluating metrics, as well as human evaluations, are carefully established on this benchmark. We also propose a strong baseline model, i.e., Localizing Actions with Transformers (LocATe), that jointly localizes and recognizes actions in a 3D sequence. The proposed LocATe shows superior performance on BABEL-TAL as well as on the large-scale PKU-MMD dataset, achieving state-of-the-art performance by using only 10% of the labeled training data. Our research could advance the development of more accurate and efficient systems for human behavior analysis, with potential applications in areas such as human-computer interaction and healthcare.

Published in Communications Engineering

ISSN: 2731-3395 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Technology: Engineering (General). Civil engineering (General)
Website: https://www.nature.com/commseng/

About the journal