Jisuanji kexue yu tansuo (Oct 2024)
Multi-scale Visual Feature Extraction and Cross-Modality Alignment for Continuous Sign Language Recognition
Abstract
Effective representation of visual feature extraction is the key to improving continuous sign language recognition performance. However, the differences in the temporal length of sign language actions and the sign language weak annotation problem make effective visual feature extraction more difficult. To focus on the above problems, a method named multi-scale visual feature extraction and cross-modality alignment for continuous sign language recognition (MECA) is proposed. The method mainly consists of a multi-scale visual feature extraction module and cross-modal alignment constraints. Specifically, in the multi-scale visual feature extraction module, the bottleneck residual structures with different dilated factors are fused in parallel to enrich the multi-scale temporal receptive field for extracting visual features with different temporal lengths. Furthermore, the hierarchical reuse design is adopted to further strengthen the visual feature. In the cross-modality alignment constraint, dynamic time warping is used to model the intrinsic relationship between sign language visual features and textual features, where textual feature extraction is achieved by the collaboration of a multilayer perceptron and a long short-term memory network. Experiments performed on the challenging public datasets RWTH-2014, RWTH-2014T and CSL-Daily show that the proposed method achieves competitive performance. The results demonstrate that the multi-scale approach proposed in MECA can capture sign language actions of distinct temporal lengths, and constructing the cross-modal alignment constraint is correct and effective for continuous sign language recognition under weak supervision.
Keywords