IEEE Access (Jan 2023)

A Novel Semi-Supervised Adversarially Learned Meta-Classifier for Detecting Neural Trojan Attacks

  • Shahram Ghahremani,
  • Amir Jalaly Bidgoly,
  • Uyen Trang Nguyen,
  • David K. Y. Yau

DOI
https://doi.org/10.1109/ACCESS.2023.3339542
Journal volume & issue
Vol. 11
pp. 138303 – 138315

Abstract

Read online

Deep neural networks (DNNs) are highly vulnerable to neural Trojan attacks. To carry out such an attack, an adversary retrains a DNN with poisoned data or modifies its parameters to produce incorrect output. These attacks can remain unnoticed until triggered by a specific pattern in the input, making detection challenging. In this article, we propose a novel semi-supervised adversarially learned meta-classifier (SESALME) to detect if a target model has been trojaned. Unlike previous Trojan detection methods, SESALME assumes that the defender has no knowledge of the attack mechanisms, and no access to training data, poisoned data, or parameters/layers of a target model. In the absence of poisoned data and knowledge of the attack mechanisms, we use a set of shadow models to emulate normal behavior of the target model. Having learned the normal behavior of the target model, SESALME then uses one-class learning, implemented within a semi-supervised generative adversarial network (GAN), to detect abnormal behavior of a model to be investigated, if any. Behavior that deviates from the learned normal behavior indicates a high likelihood that the model is trojaned. We compare the performance of SESALME with that of state-of-the-art neural Trojan detectors using popular datasets such as MNIST, CIFAR-10, and SC. Experimental results show that SESALME outperforms state-of-the-art Trojan detection methods in terms of detection performance and inference time in almost all cases, while being attack-agnostic and requiring no access to training data, poisoned data, or parameters of the target model.

Keywords