Jisuanji kexue yu tansuo (Jan 2024)
Knowledge Graph-Based Video Classification Algorithm for Film and Television Drama
Abstract
Based on the diversity of video perception modalities, a complete video tagging hierarchy classification algorithm combines visual and textual modalities to train a joint model to infer video content. However, most of the existing studies are only applicable to coarse-grained classification. Classification for film and television drama requires more fine-grained identification. This study proposes a knowledge graph-based video classification algorithm. Firstly, the algorithm extracts visual and textual features using a multimodal pre-training model, which is trained on large-scale generic data. A multi-task video label prediction model is further trained to obtain a total of three-level labels for the video: content labels, theme labels and entity labels. The difficulty of training the classification model is improved by introducing a similarity task into the multi-task network. The similarity task provides a tighter fit of similar samples, while the learned characteristics better express sample differences. Secondly, for entity labels, an entity correction model with local attention head is proposed. It can fuse, de-duplicate or extend the prediction results by introducing co-occurrence information from the knowledge graph, and produce a more accurate entity label prediction result. Based on semi-structured data retrieved from Douban, this paper constructs a film and television knowledge graph and conducts an empirical study of the video tag classification model for film and television. Experimental results show that, firstly, the cross-entropy loss function and the loss function of similarity task impose a common constraint on training the classification model, which serves to optimize the feature representation. Top-1 accuracy is improved by 3.70%, 3.35% and 16.57% for content labels, theme labels and entity labels respectively. Secondly, entity correction model with global/local attention heads improves the Top-1 accuracy of entity labels from 38.7% to 45.6% after the introduction of knowledge graph information. The proposed research is a new attempt on the multimodal video classification using image-text pair data, providing a new research idea for short video classification in the case of a small number of data samples.
Keywords