Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)

Bhagyajit Pingua; Deepak Murmu; Meenakshi Kandpal; Jyotirmayee Rautaray; Pranati Mishra; Rabindra Kumar Barik; Manob Jyoti Saikia

doi:10.7717/peerj-cs.2374

PeerJ Computer Science (Oct 2024)

Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter Jailbreak attacks (Prompt-G)

Bhagyajit Pingua,
Deepak Murmu,
Meenakshi Kandpal,
Jyotirmayee Rautaray,
Pranati Mishra,
Rabindra Kumar Barik,
Manob Jyoti Saikia

Affiliations

Bhagyajit Pingua: School of Computer Sciences, Odisha University of Technology and Research, Bhubaneswar, Odisha, India
Deepak Murmu: School of Computer Sciences, Odisha University of Technology and Research, Bhubaneswar, Odisha, India
Meenakshi Kandpal: School of Computer Sciences, Odisha University of Technology and Research, Bhubaneswar, Odisha, India
Jyotirmayee Rautaray: School of Computer Sciences, Odisha University of Technology and Research, Bhubaneswar, Odisha, India
Pranati Mishra: School of Computer Sciences, Odisha University of Technology and Research, Bhubaneswar, Odisha, India
Rabindra Kumar Barik: School of Computer Applications, KIIT Deemed to be University, Bhubaneswar, Odisha, India
Manob Jyoti Saikia: Electrical and Computer Engineering, The University of Memphis, Memphis, TN, United States

DOI: https://doi.org/10.7717/peerj-cs.2374
Journal volume & issue: Vol. 10
p. e2374

Abstract

Read online Read online

Large language models (LLMs) have become transformative tools in areas like text generation, natural language processing, and conversational AI. However, their widespread use introduces security risks, such as jailbreak attacks, which exploit LLM’s vulnerabilities to manipulate outputs or extract sensitive information. Malicious actors can use LLMs to spread misinformation, manipulate public opinion, and promote harmful ideologies, raising ethical concerns. Balancing safety and accuracy require carefully weighing potential risks against benefits. Prompt Guarding (Prompt-G) addresses these challenges by using vector databases and embedding techniques to assess the credibility of generated text, enabling real-time detection and filtering of malicious content. We collected and analyzed a dataset of Self Reminder attacks to identify and mitigate jailbreak attacks, ensuring that the LLM generates safe and accurate responses. In various attack scenarios, Prompt-G significantly reduced jailbreak success rates and effectively identified prompts that caused confusion or distraction in the LLM. Integrating our model with Llama 2 13B chat reduced the attack success rate (ASR) to 2.08%. The source code is available at: https://doi.org/10.5281/zenodo.13501821.

Published in PeerJ Computer Science

ISSN: 2376-5992 (Online)
Publisher: PeerJ Inc.
Country of publisher: United States
LCC subjects: Science: Mathematics: Instruments and machines: Electronic computers. Computer science
Website: https://peerj.com/computer-science/

About the journal

Abstract

Keywords