A new algorithm developed by researchers from the University of Pennsylvania can automatically stop safety loopholes in large language models (LLM).
Called Prompt Automatic Iterative Refinement (PAIR), the algorithm can identify “jailbreak” prompts that can trick LLMs into bypassing their safeguards for generating harmful content.
PAIR stands out among other jailbreaking techniques due to its ability to work with black-box models like ChatGPT. It also excels in generating jailbreak prompts with fewer attempts, and the prompts it creates are interpretable and transferable across multiple models.
Enterprises can use PAIR to identify and patch vulnerabilities in their LLMs in a cost-effective and timely manner.
Two types of jailbreaks
Jailbreaks typically fall into two categories: prompt-level and token-level.
Prompt-level jailbreaks employ semantically meaningful deception and social engineering to force LLMs to generate harmful content. While these jailbreaks are interpretable, their design demands considerable human effort, limiting scalability.
On the other hand, token-level jailbreaks manipulate LLM outputs by optimizing the prompt through the addition of arbitrary tokens. This method can be automated using algorithmic tools, but it often necessitates hundreds of thousands of queries and results in uninterpretable jailbreaks due to the unintelligible tokens added to the prompt.
PAIR aims to bridge this gap by combining the interpretability of prompt-level jailbreaks with the automation of token-level jailbreaks.
Attacker and target models
PAIR works by setting two black-box LLMs, an attacker and a target, against each other. The attacker model is programmed to search for candidate prompts that can jailbreak the target model. This process is fully automated, eliminating the need for human intervention.
The researchers behind PAIR explain, “Our approach is rooted in the idea that two LLMs—namely, a target T and an attacker A—can collaboratively and creatively identify prompts that are likely to jailbreak the target model.”
PAIR does not require direct access to the model’s weights and gradients. It can be applied to black-box models that are only accessible through API calls, such as OpenAI’s ChatGPT, Google’s PaLM 2, and Anthropic’s Claude 2. The researchers note, “Notably, because we assume that both LLMs are black box, the attacker and target can be instantiated with any LLMs with publicly-available query access.”
PAIR unfolds in four steps. First, the attacker receives instructions and generates a candidate prompt aimed at jailbreaking the target model in a specific task, such as writing a phishing email or a tutorial for identity theft.
Next, this prompt is passed to the target model, which generates a response. A “judge” function then scores this response. In this case, GPT-4 serves as the judge, evaluating the correspondence between the prompt and the response. If the prompt and response are not satisfactory, they are returned to the attacker along with the score, prompting the attacker to generate a new prompt.
This process is repeated until PAIR either discovers a jailbreak or exhausts a predetermined number of attempts. Importantly, PAIR can operate in parallel, allowing several candidate prompts to be sent to the target model and optimized simultaneously, enhancing efficiency.
Highly successful and transferable attacks
In their study, the researchers used the open-source Vicuna LLM, based on Meta’s Llama model, as their attacker model — and tested a variety of target models. These included open-source models such as Vicuna and Llama 2, as well as commercial models like ChatGPT, GPT-4, Claude 2, and PaLM 2.
Their findings revealed that PAIR successfully jailbroke GPT-3.5 and GPT-4 in 60% of settings, and it managed to jailbreak Vicuna-13B-v1.5 in all settings.
Interestingly, the Claude models proved to be highly resilient to attacks, with PAIR unable to jailbreak them.
One of the standout features of PAIR is its efficiency. It can generate successful jailbreaks in just a few dozen queries, sometimes even within twenty queries, with an average running time of approximately five minutes. This is a significant improvement over existing jailbreak algorithms, which typically require thousands of queries and an average of 150 minutes per attack.
Moreover, the human-interpretable nature of the attacks generated by PAIR leads to strong transferability of attacks to other LLMs. For instance, PAIR’s Vicuna prompts transferred to all other models, and PAIR’s GPT-4 prompts transferred well to Vicuna and PaLM-2. The researchers attribute this to the semantic nature of PAIR’s adversarial prompts, which target similar vulnerabilities in language models, as they are generally trained on similar next-word prediction tasks.
Looking ahead, the researchers propose enhancing PAIR to systematically generate red teaming datasets. Enterprises can use the dataset to fine-tune an attacker model to further boost the speed of PAIR and reduce the time it takes to red-team their LLMs.
LLMs as optimizers
PAIR is part of a larger suite of techniques that use LLMs as optimizers. Traditionally, users had to manually craft and adjust their prompts to extract the best results from LLMs. However, by transforming the prompting procedure into a measurable and evaluable problem, developers can create algorithms where the model’s output is looped back for optimization.
In September, DeepMind introduced a method called Optimization by PROmpting (OPRO), which uses LLMs as optimizers by giving them natural language descriptions of the problem. OPRO can solve an impressive number of problems, including optimizing chain-of-thought problems for higher performance.
As language models begin to optimize their own prompts and outputs, the pace of development in the LLM landscape could accelerate, potentially leading to new and unforeseen advancements in the field.
TechForgePulse's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.