How Johnny Can Persuade LLMs to Jailbreak Them:<br>Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Extracto
We study how to persuade LLMs to jailbreak them and advocate for more fundamental mitigation for highly interactive LLMs
Contenido
                            
TLDR: Our Persuasive Adversarial Prompts are human-readable, achieving a 92% Attack Success Rate
                            on
                            aligned
                            LLMs, without specialized optimization
                        
                        
                        
                        1Virginia Tech 
                            2Renmin University of China 
                            3UC, Davis 
                            4Stanford University
                            
                            *Lead Authors 
                            †Equal Advising
                        
A Quick Glance (Persuade GPT-4 to generate harmful social media posts)
Oversimplified Overview
This project is about how to systematically persuade LLMs to jailbreak them. The well-known "Grandma Exploit" example is also using emotional appeal, a persuasion technique, for jailbreak!
What did we introduce? A taxonomy with 40 persuasion techniques to help you be more persuasive!
What did we find? By iteratively applying diffrent persuasion techniques in our taxonomy, we successfully jailbreak advanced aligned LLMs, including Llama 2-7b Chat, GPT-3.5, and GPT-4 — achieving an astonishing 92% attack success rate, notably without any specified optimization.
Now, you might think that such a high success rate is the peak of our findings, but there's more.
In a surprising twist, we found that more advanced models like GPT-4 are more vulnerable to persuasive adversarial prompts (PAPs). What's more, adaptive defenses crafted to neutralize these PAPs also provide effective protection against a spectrum of other attacks (e.g., GCG, Masterkey, or PAIR).
P.S.: Did you notice any persuasion techniques used in the two paragraphs above?
- "achieving an astonishing 92% success rate": this uses a persuasion technique called "logical appeal"
 - "Now, you might think that such a high success rate is the peak of our findings, but there's more": this is a persuasion technique called "door in the face"
 
Congratulations! You've just finished Persuasion 101 and tasted the flavor of persuasion — didn't that "Door-in-the-face" technique give you a little surprise? The following are the results and insights we learned from using these persuasion techniques to test the safety alignment of LLMs.
                    
                            
                            Figure 1. 
                            Comparison of previous adversarial prompts and PAP, ordered by three levels of humanizing.
                            The first level treats LLMs as algorithmic systems: for instance, GCG
                            generates prompts
                            with gibberish suffix via gradient synthesis; or they
                            exploit "side-channels" like low-resource languages. The second level progresses
                            to treat
                            LLMs as
                                instruction followers: they usually rely on unconventional instruction patterns to
                            jailbreak (e.g.,
                            virtualization or
                            role-play), e.g., GPTFuzzer learns the distribution of
                            virtualization-based jailbreak
                            templates to produce
                            jailbreak variants, while PAIR asks LLMs to improve instructions as an
                            ``assistant'' and
                            often leads to prompts that employ virtualization or persona. We introduce the highest level
                            to
                            humanize and persuade LLMs as human-like communicators, and propose
                            interpretable Persuasive Adversarial Prompt
                                    (PAP). PAP
                            seamlessly weaves persuasive techniques into
                            jailbreak prompt
                            construction, which highlights the risks associated with more complex and nuanced human-like
                            communication
                            to advance AI
                            safety.
                            
                        
Results
                            
                                    
                                    Figure 2. 
                                    Broad scan results on GPT-3.5 over OpenAI's 14 risk
                                        categories. We show the
                                    PAP Success
                                            Ratio (%), the percentage of PAPs that elicit outputs with the
                                    highest harmfulness score
                                    of 5 judged by GPT-4
                                        Judge.
                                    Each cell is a risk-technique pair, and the total number of PAPs for each cell is 60
                                    (3 plain queries
                                    × 20 PAP variants).
                                    The top 5 most effective techniques for each risk category are annotated in red or
                                    white (results over
                                    30% are emphasized in white). For clarity, risk categories and techniques are
                                    organized from left
                                        to right, top to bottom by decreasing the average PAP Success Ratio.
                                    Left categories
                                    (e.g., Fraud/deception) are more susceptible to persuasion, and top techniques
                                    (e.g., Logical Appeal)
                                    are more effective. The bottom row shows the results of plain queries without
                                    persuasion.
                                
Persuasion & Different Risk Categories have Interesting Interplays.
We find persuasion effectively jailbreaks GPT-3.5 across all 14 risk categories. The interplay between risk categories and persuasion techniques highlights the challenges in addressing such user-invoked risks from persuasion. This risk, especially when involving multi-technique and multi-turn communication, emphasizes the urgency for further investigation.
PAPs' Comparison with Baselines
In real-world jailbreaks, users will refine effective prompts to improve the jailbreak process. To mimic human refinement behavior, we train on successful PAPs and iteratively deploy different persuasion techniques. Doing so jailbreaks popular aligned LLMs, such as Llama-2 and GPT models, much more effectively than existing algorithm-focused attacks. The table below shows the comparison of ASR across various jailbreak methods (PAIR, GCG, ARCA, GBDA) based on results ensembled from at least 3 trials. PAP is more effective than baseline attacks.
                    
PAPs' Efficacy Across Trials
We also extend the number of trials to 10 to test the boundary of PAPs and report the overall ASR across 10 trials. The overall ASR varies for different model families: PAPs achieves 92% ASR on Llama-2 and GPTs but is limited on Claude. Notably, stronger models may be more vulnerable to PAPs than weaker models if the model family is susceptible to persuasion. Drom the ASR within 1 and 3 trials, we see that GPT-4 is more prone to PAPs than GPT-3.5. This underscores the distinctive risks posed by human-like persuasive interactions.
                    
Re-evaluating Existing Defenses
We revisit a list of post-hoc adversarial prompt defense strategies (Rephrase, Retokenize, Rand-Drop, RAIN, Rand-Insert, Rand-Swap, and Rand-Patch). The table below shows the ASR and how much the defense can reduce the ASR. Overall, mutation-based methods outperform detection-based methods in lowering ASR. We observe the interesting trend that the more advanced the models are, the less effective current defenses are, possibly because advanced models grasp context better, making mutation-based defenses less useful. Notably, even the most effective defense can only reduce ASR on GPT-4 to 60%, which is still higher than the best baseline attack (54%). This strengthens the need for improved defenses for more capable models.
                    
Exploring Adaptive Defenses
We investigate two straightforward and intuitive adaptive defense tactics: "Adaptive System Prompt" and "Targeted Summarization", designed to counteract the influence of persuasive contexts in PAPs. We reveal that they are effective in counteracting PAPs and they can also defend other types of jailbreak prompts beyond PAPs. These observations suggest that although different adversarial prompts are generated by different procedures (gradient-based, modification-based, etc.), their core mechanisms may be related to persuading the LLM into compliance. We also find that there exists a trade-off between safety and utility. So the selection of a defense strategy should be tailored to individual models and specific safety goals.
                    
Examples
                            We present a few examples of jailbreak and defense effects from each main section of our
                            paper. 
                            
                            Content Warning: Some jailbreak examples may still contain
                                offensive contents
                                in nature! We redact certain contents and omit the example from the risk category 2,
                                Children harm, due
                                to safety concerns.
                            
                        
(A) Broad Scan Examples
(B) In-depth Iterative Probe Examples
(C) Defense against PAPs Examples
(D) Adaptive Defense against Other Attack Examples
We show examples of how adaptive defense to PAPs also generalize to effectively neutralize successful attack cases from other attacks (Jailbreak prompts, GPTFuzzer, Masterkey, GCG, PAIR). The harmful outputs are omitted.
Ethics and Disclosure
This project provides a structured way to generate interpretable persuasive adversarial prompts (PAP) at scale, which could potentially allow everyday users to jailbreak LLM without much computing. But as mentioned, a Reddit user has already employed persuasion to attack LLM before, so it is in urgent need to more systematically study the vulnerabilities around persuasive jailbreak to better mitigate them. Therefore, despite the risks involved, we believe it is crucial to share our findings in full. We followed ethical guidelines throughout our study.
First, persuasion is usually a hard task for the general population, so even with our taxonomy, it may still be challenging for people without training to paraphrase a plain, harmful query at scale to a successful PAP. Therefore, the real-world risk of a widespread attack from millions of users is relatively low. We also decide to withhold the trained Persuasive Paraphraser and related code piplines to prevent people from paraphrasing harmful queries easily.
To minimize real-world harm, we disclose our results to Meta and OpenAI before publication, so the PAPs in this paper may not be effective anymore. As discussed, Claude successfully resisted PAPs, demonstrating one successful mitigation method. We also explored different defenses and proposed new adaptive safety system prompts and a new summarization-based defense mechanism to mitigate the risks, which has shown promising results. We aim to improve these defenses in future work.
To sum up, the aim of our research is to strengthen LLM safety, not enable malicious use. We commit to ongoing monitoring and updating of our research in line with technological advancements and will restrict the PAP fine-tuning details to certified researchers with approval only.
BibTeX
If you find our project useful, please consider citing:
@misc{zeng2024johnny,
      title={How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs},
      author={Zeng, Yi and Lin, Hongpeng and Zhang, Jingwen and Yang, Diyi and Jia, Ruoxi and Shi, Weiyan},
      year={2024},
      
  }