Multimodal Spatiotemporal Networks for Sign Language Recognition

Shujun Zhang; Weijia Meng; Hui Li; Xuehong Cui

doi:10.1109/ACCESS.2019.2959206

IEEE Access (Jan 2019)

Multimodal Spatiotemporal Networks for Sign Language Recognition

Shujun Zhang,
Weijia Meng,
Hui Li,
Xuehong Cui

Affiliations

Shujun Zhang: ORCiD; College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, China
Weijia Meng: ORCiD; College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, China
Hui Li: ORCiD; College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, China
Xuehong Cui: ORCiD; College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, China

DOI: https://doi.org/10.1109/ACCESS.2019.2959206
Journal volume & issue: Vol. 7
pp. 180270 – 180280

Abstract

Read online

Different from other human behaviors, sign language has the characteristics of limited local motion of upper limb and meticulous hand action. Some sign language gestures are ambiguous in RGB video due to the influence of lighting and background color, which affects the recognition accuracy. We propose a multimodal deep learning architecture for sign language recognition which effectively combines RGB-D input and two-stream spatiotemporal networks. Depth videos, as an effective compensation of RGB input, can supply additional distance information about the signer's hands. A novel sampling method called ARSS (Aligned Random Sampling in Segments) is put forward to select and align optimal RGB-D video frames, which improves the capacity utilization of multimodal data and reduces the redundancy. We get the hand ROI by joints information of RGB data for local focus in spatial stream. D-shift Net is proposed as depth motion feature extraction in temporal stream, which fully utilizes three dimensional motion information of the sign language. Both streams are fused by convolutional fusion layer to get complementary features. Our approach explored the multimodal information and enhanced the recognition precision. It obtains the state-the-of-art performance on the datasets of CSL (96.7%) and IsoGD (63.78%).

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords