A Graph Skeleton Transformer Network for Action Recognition

Yujian Jiang; Zhaoneng Sun; Saisai Yu; Shuang Wang; Yang Song

doi:10.3390/sym14081547

Symmetry (Jul 2022)

A Graph Skeleton Transformer Network for Action Recognition

Yujian Jiang,
Zhaoneng Sun,
Saisai Yu,
Shuang Wang,
Yang Song

Affiliations

Yujian Jiang: State Key Laboratory of Media Convergence of Communication, Communication University of China, Beijing 100024, China
Zhaoneng Sun: State Key Laboratory of Media Convergence of Communication, Communication University of China, Beijing 100024, China
Saisai Yu: State Key Laboratory of Media Convergence of Communication, Communication University of China, Beijing 100024, China
Shuang Wang: State Key Laboratory of Media Convergence of Communication, Communication University of China, Beijing 100024, China
Yang Song: State Key Laboratory of Media Convergence of Communication, Communication University of China, Beijing 100024, China

DOI: https://doi.org/10.3390/sym14081547
Journal volume & issue: Vol. 14, no. 8
p. 1547

Abstract

Read online

Skeleton-based action recognition is a research hotspot in the field of computer vision. Currently, the mainstream method is based on Graph Convolutional Networks (GCNs). Although there are many advantages of GCNs, GCNs mainly rely on graph topologies to draw dependencies between the joints, which are limited in capturing long-distance dependencies. Meanwhile, Transformer-based methods have been applied to skeleton-based action recognition because they effectively capture long-distance dependencies. However, existing Transformer-based methods lose the inherent connection information of human skeleton joints because they do not yet focus on initial graph structure information. This paper aims to improve the accuracy of skeleton-based action recognition. Therefore, a Graph Skeleton Transformer network (GSTN) for action recognition is proposed, which is based on Transformer architecture to extract global features, while using undirected graph information represented by the symmetric matrix to extract local features. Two encodings are utilized in feature processing to improve joints’ semantic and centrality features. In the process of multi-stream fusion strategies, a grid-search-based method is used to assign weights to each input stream to optimize the fusion results. We tested our method using three action recognition datasets: NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA. The experimental results show that our model’s accuracy is comparable to state-of-the-art approaches.

Published in Symmetry

ISSN: 2073-8994 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Mathematics
Website: http://www.mdpi.com/journal/symmetry/

About the journal

Abstract

Keywords