Complex & Intelligent Systems (Nov 2024)
Correlation-guided decoding strategy for low-resource Uyghur scene text recognition
Abstract
Abstract Currently, most state-of-the-art scene text recognition methods are based on the Transformer architecture and rely on pre-trained large language models. However, these pre-trained models are primarily designed for resource-rich languages and exhibit limitations when applied to low-resource languages. We propose a Correlation-Guided Decoding Strategy for Low-Resource Uyghur Scene Text Recognition (CGDS). Specifically, (1) CGDS employs a hybrid encoding strategy that combines Convolutional Neural Network (CNN) and Transformer. This hybrid encoding effectively leverages the advantages of both methods: On one hand, the convolutional properties and shared weight mechanism of CNN allow for efficient extraction of local features, reducing dependency on large datasets and minimizing errors caused by similar characters. On the other hand, the global attention mechanism of Transformer captures longer-distance dependencies, enhancing the informational linkage between characters and thereby improving recognition accuracy. Finally, through a dynamic fusion method, the features from CNN and Transformer are dynamically integrated, adaptively allocating the weights of CNN and Transformer features during the model training process, thereby achieving a dynamic balance between local and global features. (2) To further enhance the feature extraction capabilities, we designed a Correlation-Guided Decoding (CGD) module. Unlike existing decoding strategies, we adopt a dual-decoder approach with the Transformer and CGD decoders. The role of the CGD decoder is to perform correlation calculations using the outputs from the Transformer decoder and the encoder to optimize the final recognition performance. At the same time, the CGD decoder can utilize the outputs from the Transformer decoder to provide semantic guidance for the feature extraction of the encoder, enabling the model to understand the semantic structure within the input data better. This dual-decoder strategy can better guide the model in extracting effective features, enhancing the model’s ability to learn internal language knowledge and more fully utilize the useful information in the input data. (3) We constructed two Uyghur scene text datasets named U1 and U2. Experimental results show that our method achieves superior performance in low-resource Uyghur scene text recognition compared to existing technologies. Specifically, CGDS improved accuracy by 50.2% on the U1 and 13.6% on the U2 and achieved an overall accuracy improvement of 15.9%.
Keywords