A deep action-oriented video image classification system for text detection and recognition

Abhra Chaudhuri; Palaiahnakote Shivakumara; Pinaki Nath Chowdhury; Umapada Pal; Tong Lu; Daniel Lopresti; G. Hemantha Kumar

doi:10.1007/s42452-021-04821-z

SN Applied Sciences (Oct 2021)

A deep action-oriented video image classification system for text detection and recognition

Abhra Chaudhuri,
Palaiahnakote Shivakumara,
Pinaki Nath Chowdhury,
Umapada Pal,
Tong Lu,
Daniel Lopresti,
G. Hemantha Kumar

Affiliations

Abhra Chaudhuri: Computer Vision and Pattern Recognition Unit, Indian Statistical Institute
Palaiahnakote Shivakumara: Department of Computer System and Technology, Universiti Malaya
Pinaki Nath Chowdhury: Computer Vision and Pattern Recognition Unit, Indian Statistical Institute
Umapada Pal: Computer Vision and Pattern Recognition Unit, Indian Statistical Institute
Tong Lu: National Key Lab for Novel Software Technology, Nanjing University
Daniel Lopresti: Computer Science & Engineering, Lehigh University
G. Hemantha Kumar: Department of Studies in Computer Science, University of Mysore

DOI: https://doi.org/10.1007/s42452-021-04821-z
Journal volume & issue: Vol. 3, no. 11
pp. 1 – 24

Abstract

Read online

Abstract For the video images with complex actions, achieving accurate text detection and recognition results is very challenging. This paper presents a hybrid model for classification of action-oriented video images which reduces the complexity of the problem to improve text detection and recognition performance. Here, we consider the following five categories of genres, namely concert, cooking, craft, teleshopping and yoga. For classifying action-oriented video images, we explore ResNet50 for learning the general pixel-distribution level information and the VGG16 network is implemented for learning the features of Maximally Stable Extremal Regions and again another VGG16 is used for learning facial components obtained by a multitask cascaded convolutional network. The approach integrates the outputs of the three above-mentioned models using a fully connected neural network for classification of five action-oriented image classes. We demonstrated the efficacy of the proposed method by testing on our dataset and two other standard datasets, namely, Scene Text Dataset dataset which contains 10 classes of scene images with text information, and the Stanford 40 Actions dataset which contains 40 action classes without text information. Our method outperforms the related existing work and enhances the class-specific performance of text detection and recognition, significantly. Article highlights 1. The method uses pixel, stable-region and face-component information in a noble way for solving complex classification problems. 2. The proposed work fuses different deep learning models for successful classification of action-oriented images. 3. Experiments on our own dataset as well as standard datasets show that the proposed model outperforms related state-of-the-art (SOTA) methods.

Published in SN Applied Sciences

ISSN: 2523-3963 (Print); 2523-3971 (Online)
Publisher: Springer
Country of publisher: Switzerland
LCC subjects: Science; Technology
Website: https://www.springer.com/snas

About the journal

Abstract

Keywords