Electronic Research Archive (Dec 2024)

ViT-DualAtt: An efficient pornographic image classification method based on Vision Transformer with dual attention

  • Zengyu Cai,
  • Liusen Xu,
  • Jianwei Zhang,
  • Yuan Feng,
  • Liang Zhu,
  • Fangmei Liu

DOI
https://doi.org/10.3934/era.2024313
Journal volume & issue
Vol. 32, no. 12
pp. 6698 – 6716

Abstract

Read online

Pornographic images not only pollute the internet environment, but also potentially harm societal values and the mental health of young people. Therefore, accurately classifying and filtering pornographic images is crucial to maintaining the safety of the online community. In this paper, we propose a novel pornographic image classification model named ViT-DualAtt. The model adopts a CNN-Transformer hierarchical structure, combining the strengths of Convolutional Neural Networks (CNNs) and Transformers to effectively capture and integrate both local and global features, thereby enhancing feature representation accuracy and diversity. Moreover, the model integrates multi-head attention and convolutional block attention mechanisms to further improve classification accuracy. Experiments were conducted using the nsfw_data_scrapper dataset publicly available on GitHub by data scientist Alexander Kim. Our results demonstrated that ViT-DualAtt achieved a classification accuracy of 97.2% ± 0.1% in pornographic image classification tasks, outperforming the current state-of-the-art model (RepVGG-SimAM) by 2.7%. Furthermore, the model achieves a pornographic image miss rate of only 1.6%, significantly reducing the risk of pornographic image dissemination on internet platforms.

Keywords