Applied Sciences (Aug 2021)

High Performance DeepFake Video Detection on CNN-Based with Attention Target-Specific Regions and Manual Distillation Extraction

  • Van-Nhan Tran,
  • Suk-Hwan Lee,
  • Hoanh-Su Le,
  • Ki-Ryong Kwon

DOI
https://doi.org/10.3390/app11167678
Journal volume & issue
Vol. 11, no. 16
p. 7678

Abstract

Read online

The rapid development of deep learning models that can produce and synthesize hyper-realistic videos are known as DeepFakes. Moreover, the growth of forgery data has prompted concerns about malevolent intent usage. Detecting forgery videos are a crucial subject in the field of digital media. Nowadays, most models are based on deep learning neural networks and vision transformer, SOTA model with EfficientNetB7 backbone. However, due to the usage of excessively large backbones, these models have the intrinsic drawback of being too heavy. In our research, a high performance DeepFake detection model for manipulated video is proposed, ensuring accuracy of the model while keeping an appropriate weight. We inherited content from previous research projects related to distillation methodology but our proposal approached in a different way with manual distillation extraction, target-specific regions extraction, data augmentation, frame and multi-region ensemble, along with suggesting a CNN-based model as well as flexible classification with a dynamic threshold. Our proposal can reduce the overfitting problem, a common and particularly important problem affecting the quality of many models. So as to analyze the quality of our model, we performed tests on two datasets. DeepFake Detection Dataset (DFDC) with our model obtains 0.958 of AUC and 0.9243 of F1-score, compared with the SOTA model which obtains 0.972 of AUC and 0.906 of F1-score, and the smaller dataset Celeb-DF v2 with 0.978 of AUC and 0.9628 of F1-score.

Keywords