EvilPromptFuzzer: generating inappropriate content based on text-to-image models

Juntao He; Haoran Dai; Runqi Sui; Xuejing Yuan; Dun Liu; Hao Feng; Xinyue Liu; Wenchuan Yang; Baojiang Cui; Kedan Li

doi:10.1186/s42400-024-00279-9

Cybersecurity (Aug 2024)

EvilPromptFuzzer: generating inappropriate content based on text-to-image models

Juntao He,
Haoran Dai,
Runqi Sui,
Xuejing Yuan,
Dun Liu,
Hao Feng,
Xinyue Liu,
Wenchuan Yang,
Baojiang Cui,
Kedan Li

Affiliations

Juntao He: School of Cyberspace Security, Beijing University of Posts and Telecommunications
Haoran Dai: Revery AI Inc
Runqi Sui: School of Cyberspace Security, Beijing University of Posts and Telecommunications
Xuejing Yuan: School of Cyberspace Security, Beijing University of Posts and Telecommunications
Dun Liu
Hao Feng: School of Cyberspace Security, Beijing University of Posts and Telecommunications
Xinyue Liu: School of Cyberspace Security, Beijing University of Posts and Telecommunications
Wenchuan Yang: School of Cyberspace Security, Beijing University of Posts and Telecommunications
Baojiang Cui: School of Cyberspace Security, Beijing University of Posts and Telecommunications
Kedan Li: Revery AI Inc

DOI: https://doi.org/10.1186/s42400-024-00279-9
Journal volume & issue: Vol. 7, no. 1
pp. 1 – 20

Abstract

Read online

Abstract Text-to-image (TTI) models provide huge innovation ability for many industries, while the content security triggered by them has also attracted wide attention. Considerable research has focused on content security threats of large language models (LLMs), yet comprehensive studies on the content security of TTI models are notably scarce. This paper introduces a systematic tool, named EvilPromptFuzzer, designed to fuzz evil prompts in TTI models. For 15 kinds of fine-grained risks, EvilPromptFuzzer employs the strong knowledge-mining ability of LLMs to construct seed banks, in which the seeds cover various types of characters, interrelations, actions, objects, expressions, body parts, locations, surroundings, etc. Subsequently, these seeds are fed into the LLMs to build scene-diverse prompts, which can weaken the semantic sensitivity related to the fine-grained risks. Hence, the prompts can bypass the content audit mechanism of the TTI model, and ultimately help to generate images with inappropriate content. For the risks of violence, horrible, disgusting, animal cruelty, religious bias, political symbol, and extremism, the efficiency of EvilPromptFuzzer for generating inappropriate images based on DALL.E 3 are greater than 30%, namely, more than 30 generated images are malicious among 100 prompts. Specifically, the efficiency of horrible, disgusting, political symbols, and extremism up to 58%, 64%, 71%, and 50%, respectively. Additionally, we analyzed the vulnerability of existing popular content audit platforms, including Amazon, Google, Azure, and Baidu. Even the most effective Google SafeSearch cloud platform identifies only 33.85% of malicious images across three distinct categories.

Published in Cybersecurity

ISSN: 2523-3246 (Online)
Publisher: SpringerOpen
Country of publisher: Singapore
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering: Electronics: Computer engineering. Computer hardware; Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://cybersecurity.springeropen.com/

About the journal

Abstract

Keywords