Self-Supervised Learning to Detect Key Frames in Videos

Xiang Yan; Syed Zulqarnain Gilani; Mingtao Feng; Liang Zhang; Hanlin Qin; Ajmal Mian

doi:10.3390/s20236941

Sensors (Dec 2020)

Self-Supervised Learning to Detect Key Frames in Videos

Xiang Yan,
Syed Zulqarnain Gilani,
Mingtao Feng,
Liang Zhang,
Hanlin Qin,
Ajmal Mian

Affiliations

Xiang Yan: School of Physics and Optoelectronic Engineering, Xidian University, Xi’an 710071, China
Syed Zulqarnain Gilani: School of Science, Edith Cowan University, Joondalup 6027, Australia
Mingtao Feng: School of Computer Science and Technology, Xidian University, Xi’an 710071, China
Liang Zhang: School of Computer Science and Technology, Xidian University, Xi’an 710071, China
Hanlin Qin: School of Physics and Optoelectronic Engineering, Xidian University, Xi’an 710071, China
Ajmal Mian: Computer Science and Software Engineering, University of Western Australia, Crawley 6009, Australia

DOI: https://doi.org/10.3390/s20236941
Journal volume & issue: Vol. 20, no. 23
p. 6941

Abstract

Read online

Detecting key frames in videos is a common problem in many applications such as video classification, action recognition and video summarization. These tasks can be performed more efficiently using only a handful of key frames rather than the full video. Existing key frame detection approaches are mostly designed for supervised learning and require manual labelling of key frames in a large corpus of training data to train the models. Labelling requires human annotators from different backgrounds to annotate key frames in videos which is not only expensive and time consuming but also prone to subjective errors and inconsistencies between the labelers. To overcome these problems, we propose an automatic self-supervised method for detecting key frames in a video. Our method comprises a two-stream ConvNet and a novel automatic annotation architecture able to reliably annotate key frames in a video for self-supervised learning of the ConvNet. The proposed ConvNet learns deep appearance and motion features to detect frames that are unique. The trained network is then able to detect key frames in test videos. Extensive experiments on UCF101 human action and video summarization VSUMM datasets demonstrates the effectiveness of our proposed method.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords