IEEE Access (Jan 2024)
Adversarial Robustness of Vision Transformers Versus Convolutional Neural Networks
Abstract
Vision Transformers (ViTs) have proved to be a more powerful substitute for Convolutional Neural Networks (CNNs) in various computer vision tasks, using the self-attention approach to gain remarkable results and observations. However, the adversarial robustness of ViTs against adversarial attack methods raises critical questions, and the issues of using these models in security-related applications remain under discussion. This paper presents a novel and systematic approach to evaluate and compare the adversarial robustness of ViTs with CNNs, explicitly concentrating on the image classification problem. We have performed extensive experiments using state-of-the-art adversarial example attacks, such as the Fast Gradient Sign Method (FGSM), Projected Gradient Descent (PGD), and DeepFool Attack (DFA). The findings of this research study represent that CNNs are more robust against more straightforward attacks such as FGSM. Still, ViTs show excellent resistance against more dangerous attacks like PGD and DFA attack methods. This work provides useful outcomes revealing the advantages and limitations of CNNs and ViTs, which are helpful for further study and applications regarding safer and more effective use of deep learning models of CNNs and ViTs.
Keywords