Efficient Method for Robust Backdoor Detection and Removal in Feature Space Using Clean Data

Donik Vrsnak; Marko Subasic; Sven Loncaric

doi:10.1109/ACCESS.2025.3531716

IEEE Access (Jan 2025)

Efficient Method for Robust Backdoor Detection and Removal in Feature Space Using Clean Data

Donik Vrsnak,
Marko Subasic,
Sven Loncaric

Affiliations

Donik Vrsnak: ORCiD; Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia
Marko Subasic: ORCiD; Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia
Sven Loncaric: ORCiD; Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia

DOI: https://doi.org/10.1109/ACCESS.2025.3531716
Journal volume & issue: Vol. 13
pp. 18215 – 18227

Abstract

Read online

The steady increase of proposed backdoor attacks on deep neural networks highlights the need for robust defense methods for their detection and removal. A backdoor attack is a type of attack where hidden triggers are added to the input data during training, with the goal of changing the behavior of the model during inference. These attacks pose a significant security threat in critical applications, such as street sign or pedestrian recognition for autonomous vehicles, biometric authentication, image retrieval, semantic labeling, etc. To combat these threats, many defense mechanisms have been proposed. These methods target different areas, such as computer vision (CV), natural language processing (NLP), and thus utilize different assumptions about the nature of the input data and the type of backdoor trigger used in the attack. However, the attacker can exploit these assumptions, which reduces their successfulness in real-world scenarios. Thus, a robust method for backdoor detection needs to have broad and simple assumptions. Furthermore, detection methods that rely on the input data suffer from the fact that they are constrained to the modality of the input and cannot apply to different modalities. In this work, a novel method for backdoor detection and removal for classification tasks using features extracted by the attacked model called FEAT-IN is proposed. This method can detect and reconstruct the feature representation of the possible triggers used in attacking the neural network. Using these reconstructed trigger features, the method can be used to efficiently mitigate the effects of an attack. Extensive experiments on multiple datasets and attack methods demonstrate that, when compared to state-of-the-art methods such as Neural Cleanse, Neural Attention Distillation, I-BAU, BTI-DBF etc. the FEAT-IN method provides several benefits. It can more consistently detect and mitigate backdoor attacks than similar trigger inversion defense methods that conduct the defense in the input space instead of feature space (where, on average, it achieves approx. 10% higher decrease in attack success rate during mitigation compared to the second-best method). Secondly, it reduces the memory footprint and the computation time by at least an order of magnitude compared to other methods, which allows FEAT-IN to be used practically in real-world scenarios. Finally, it is not constrained to only computer vision tasks, as this assumption holds for feature spaces of different problems, which is demonstrated by applying it without any change to semantic analysis on the SST-2 dataset.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords