How Johnny Can Persuade LLMs to Jailbreak Them:<br>Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Extracto
We study how to persuade LLMs to jailbreak them and advocate for more fundamental mitigation for highly interactive LLMs
Contenido
TLDR: Our Persuasive Adversarial Prompts are human-readable, achieving a 92% Attack Success Rate
on
aligned
LLMs, without specialized optimization
1Virginia Tech
2Renmin University of China
3UC, Davis
4Stanford University
*Lead Authors
†Equal Advising
A Quick Glance (Persuade GPT-4 to generate harmful social media posts)
Oversimplified Overview
This project is about how to systematically persuade LLMs to jailbreak them. The well-known "Grandma Exploit" example is also using emotional appeal, a persuasion technique, for jailbreak!
What did we introduce? A taxonomy with 40 persuasion techniques to help you be more persuasive!
What did we find? By iteratively applying diffrent persuasion techniques in our taxonomy, we successfully jailbreak advanced aligned LLMs, including Llama 2-7b Chat, GPT-3.5, and GPT-4 — achieving an astonishing 92% attack success rate, notably without any specified optimization.
Now, you might think that such a high success rate is the peak of our findings, but there's more.
In a surprising twist, we found that more advanced models like GPT-4 are more vulnerable to persuasive adversarial prompts (PAPs). What's more, adaptive defenses crafted to neutralize these PAPs also provide effective protection against a spectrum of other attacks (e.g., GCG, Masterkey, or PAIR).
P.S.: Did you notice any persuasion techniques used in the two paragraphs above?
- "achieving an astonishing 92% success rate": this uses a persuasion technique called "logical appeal"
- "Now, you might think that such a high success rate is the peak of our findings, but there's more": this is a persuasion technique called "door in the face"
Congratulations! You've just finished Persuasion 101 and tasted the flavor of persuasion — didn't that "Door-in-the-face" technique give you a little surprise? The following are the results and insights we learned from using these persuasion techniques to test the safety alignment of LLMs.
Figure 1.
Comparison of previous adversarial prompts and PAP, ordered by three levels of humanizing.
The first level treats LLMs as algorithmic systems: for instance, GCG
generates prompts
with gibberish suffix via gradient synthesis; or they
exploit "side-channels" like low-resource languages. The second level progresses
to treat
LLMs as
instruction followers: they usually rely on unconventional instruction patterns to
jailbreak (e.g.,
virtualization or
role-play), e.g., GPTFuzzer learns the distribution of
virtualization-based jailbreak
templates to produce
jailbreak variants, while PAIR asks LLMs to improve instructions as an
``assistant'' and
often leads to prompts that employ virtualization or persona. We introduce the highest level
to
humanize and persuade LLMs as human-like communicators, and propose
interpretable Persuasive Adversarial Prompt
(PAP). PAP
seamlessly weaves persuasive techniques into
jailbreak prompt
construction, which highlights the risks associated with more complex and nuanced human-like
communication
to advance AI
safety.
Results
Figure 2.
Broad scan results on GPT-3.5 over OpenAI's 14 risk
categories. We show the
PAP Success
Ratio (%), the percentage of PAPs that elicit outputs with the
highest harmfulness score
of 5 judged by GPT-4
Judge.
Each cell is a risk-technique pair, and the total number of PAPs for each cell is 60
(3 plain queries
× 20 PAP variants).
The top 5 most effective techniques for each risk category are annotated in red or
white (results over
30% are emphasized in white). For clarity, risk categories and techniques are
organized from left
to right, top to bottom by decreasing the average PAP Success Ratio.
Left categories
(e.g., Fraud/deception) are more susceptible to persuasion, and top techniques
(e.g., Logical Appeal)
are more effective. The bottom row shows the results of plain queries without
persuasion.
Persuasion & Different Risk Categories have Interesting Interplays.
We find persuasion effectively jailbreaks GPT-3.5 across all 14 risk categories. The interplay between risk categories and persuasion techniques highlights the challenges in addressing such user-invoked risks from persuasion. This risk, especially when involving multi-technique and multi-turn communication, emphasizes the urgency for further investigation.
PAPs' Comparison with Baselines
In real-world jailbreaks, users will refine effective prompts to improve the jailbreak process. To mimic human refinement behavior, we train on successful PAPs and iteratively deploy different persuasion techniques. Doing so jailbreaks popular aligned LLMs, such as Llama-2 and GPT models, much more effectively than existing algorithm-focused attacks. The table below shows the comparison of ASR across various jailbreak methods (PAIR, GCG, ARCA, GBDA) based on results ensembled from at least 3 trials. PAP is more effective than baseline attacks.
PAPs' Efficacy Across Trials
We also extend the number of trials to 10 to test the boundary of PAPs and report the overall ASR across 10 trials. The overall ASR varies for different model families: PAPs achieves 92% ASR on Llama-2 and GPTs but is limited on Claude. Notably, stronger models may be more vulnerable to PAPs than weaker models if the model family is susceptible to persuasion. Drom the ASR within 1 and 3 trials, we see that GPT-4 is more prone to PAPs than GPT-3.5. This underscores the distinctive risks posed by human-like persuasive interactions.
Re-evaluating Existing Defenses
We revisit a list of post-hoc adversarial prompt defense strategies (Rephrase, Retokenize, Rand-Drop, RAIN, Rand-Insert, Rand-Swap, and Rand-Patch). The table below shows the ASR and how much the defense can reduce the ASR. Overall, mutation-based methods outperform detection-based methods in lowering ASR. We observe the interesting trend that the more advanced the models are, the less effective current defenses are, possibly because advanced models grasp context better, making mutation-based defenses less useful. Notably, even the most effective defense can only reduce ASR on GPT-4 to 60%, which is still higher than the best baseline attack (54%). This strengthens the need for improved defenses for more capable models.
Exploring Adaptive Defenses
We investigate two straightforward and intuitive adaptive defense tactics: "Adaptive System Prompt" and "Targeted Summarization", designed to counteract the influence of persuasive contexts in PAPs. We reveal that they are effective in counteracting PAPs and they can also defend other types of jailbreak prompts beyond PAPs. These observations suggest that although different adversarial prompts are generated by different procedures (gradient-based, modification-based, etc.), their core mechanisms may be related to persuading the LLM into compliance. We also find that there exists a trade-off between safety and utility. So the selection of a defense strategy should be tailored to individual models and specific safety goals.
Examples
We present a few examples of jailbreak and defense effects from each main section of our
paper.
Content Warning: Some jailbreak examples may still contain
offensive contents
in nature! We redact certain contents and omit the example from the risk category 2,
Children harm, due
to safety concerns.
(A) Broad Scan Examples
(B) In-depth Iterative Probe Examples
(C) Defense against PAPs Examples
(D) Adaptive Defense against Other Attack Examples
We show examples of how adaptive defense to PAPs also generalize to effectively neutralize successful attack cases from other attacks (Jailbreak prompts, GPTFuzzer, Masterkey, GCG, PAIR). The harmful outputs are omitted.
Ethics and Disclosure
This project provides a structured way to generate interpretable persuasive adversarial prompts (PAP) at scale, which could potentially allow everyday users to jailbreak LLM without much computing. But as mentioned, a Reddit user has already employed persuasion to attack LLM before, so it is in urgent need to more systematically study the vulnerabilities around persuasive jailbreak to better mitigate them. Therefore, despite the risks involved, we believe it is crucial to share our findings in full. We followed ethical guidelines throughout our study.
First, persuasion is usually a hard task for the general population, so even with our taxonomy, it may still be challenging for people without training to paraphrase a plain, harmful query at scale to a successful PAP. Therefore, the real-world risk of a widespread attack from millions of users is relatively low. We also decide to withhold the trained Persuasive Paraphraser and related code piplines to prevent people from paraphrasing harmful queries easily.
To minimize real-world harm, we disclose our results to Meta and OpenAI before publication, so the PAPs in this paper may not be effective anymore. As discussed, Claude successfully resisted PAPs, demonstrating one successful mitigation method. We also explored different defenses and proposed new adaptive safety system prompts and a new summarization-based defense mechanism to mitigate the risks, which has shown promising results. We aim to improve these defenses in future work.
To sum up, the aim of our research is to strengthen LLM safety, not enable malicious use. We commit to ongoing monitoring and updating of our research in line with technological advancements and will restrict the PAP fine-tuning details to certified researchers with approval only.
BibTeX
If you find our project useful, please consider citing:
@misc{zeng2024johnny,
title={How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs},
author={Zeng, Yi and Lin, Hongpeng and Zhang, Jingwen and Yang, Diyi and Jia, Ruoxi and Shi, Weiyan},
year={2024},
}