PeerJ Computer Science (Oct 2024)
Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)
Abstract
Large language models (LLMs) have become transformative tools in areas like text generation, natural language processing, and conversational AI. However, their widespread use introduces security risks, such as jailbreak attacks, which exploit LLM’s vulnerabilities to manipulate outputs or extract sensitive information. Malicious actors can use LLMs to spread misinformation, manipulate public opinion, and promote harmful ideologies, raising ethical concerns. Balancing safety and accuracy require carefully weighing potential risks against benefits. Prompt Guarding (Prompt-G) addresses these challenges by using vector databases and embedding techniques to assess the credibility of generated text, enabling real-time detection and filtering of malicious content. We collected and analyzed a dataset of Self Reminder attacks to identify and mitigate jailbreak attacks, ensuring that the LLM generates safe and accurate responses. In various attack scenarios, Prompt-G significantly reduced jailbreak success rates and effectively identified prompts that caused confusion or distraction in the LLM. Integrating our model with Llama 2 13B chat reduced the attack success rate (ASR) to 2.08%. The source code is available at: https://doi.org/10.5281/zenodo.13501821.
Keywords