Безопасность информационных технологий (Nov 2024)

An ensemble of modern computer vision models for deepfake detection

  • Aleksandr S. Pikul

DOI
https://doi.org/10.26583/bit.2024.4.08
Journal volume & issue
Vol. 31, no. 4
pp. 116 – 127

Abstract

Read online

This article explores the potential use of modern computer vision architectures for the task of deepfake detection. The following architectures are considered: EfficientNet, Vision Transformer (ViT), VisionLSTM (ViL), Vision KAN, and Mamba Vision. The novelty of the approach lies in the application and comparison of these architectures, as well as their combination into paired ensembles to improve the accuracy of deepfake detection. The study conducted an experiment based on the application of multiple architectures for image processing. Each architecture was used both individually and as part of an ensemble consisting of two models. The dataset for the experiment was created from video frames containing deepfakes, and these frames were subjected to various augmentations. The experimental results demonstrated that using ensembles of modern architectures improves the accuracy of deepfake recognition. The ensemble of ViT and VisionLSTM achieved an -score of 97.68%, which is higher than the performance of these architectures when used individually. However, not all ensembles resulted in improved metrics. For example, the combination of Mamba Vision and VisionLSTM showed a decrease in -score to 95.78% compared to using Mamba Vision alone. The research findings are valuable for professionals working in computer vision, cybersecurity, and multimedia content analysis. The proposed architectures and their ensembles can be effectively used in tasks related to deepfake detection and other forms of fake content, which is crucial for protection against information threats.

Keywords