‘Many-shot jailbreak’: lab reveals how AI safety features can be easily bypassed

written by Elijah April 3, 2024 0 comment

Research shows that the security features of some of the most powerful AI tools that prevent them from being used for cybercrime or terrorism can be circumvented by flooding them with examples of misconduct.

In a paper from the AI lab Anthropic, which produces the large language model (LLM) behind ChatGPT rival Claude, researchers described an attack they called “many-shot jailbreaking.” The attack was as simple as it was effective.

Claude, like most major commercial AI systems, includes safeguards designed to encourage it to deny certain requests, such as generating violent or hate speech, producing instructions for illegal activities, misleading, or discriminating. A user asking the system for instructions to build a bomb, for example, will receive a polite refusal to intervene.

But AI systems often work better – at any task – if they are given examples of what to do ‘right’. And it turns out that if you give enough examples – hundreds of them – of the “correct” answer to harmful questions like “how do I tie someone up,” “how do I cheat money,” or “how do I make meth,” the system will happily buck the trend. continue and answer the last question yourself.

“By including large amounts of text in a specific configuration, this technique can force LLMs to produce potentially harmful responses, despite being trained not to do so,” Anthropic said. The company added that it had already shared its research with colleagues and was now going public to resolve the issue “as quickly as possible.”

While the attack, known as a jailbreak, is simple, it is unprecedented because it requires an AI model with a large “context window”: the ability to respond to a query of many thousands of words. Simpler AI models cannot be fooled in this way, as they would essentially forget the beginning of the question before reaching the end, but the cutting edge of AI development opens up new avenues for attack.

Newer, more complex AI systems appear more vulnerable to such attacks, not to mention the fact that they can process longer inputs. Anthropic said this may be because those systems were better at learning from example, which meant they also learned faster to get around their own rules.

“Given that larger models are potentially the most damaging, the fact that this jailbreak works so well is particularly concerning,” the report said.

skip the newsletter promotion

The company has found a number of approaches to the problem that work. The simplest approach seems to be that adding a mandatory warning after the user’s input and reminding the system not to give malicious responses would significantly reduce the chance of an effective jailbreak. However, the researchers say this approach could also make the system worse at other tasks.

‘Many-shot jailbreak’: lab reveals how AI safety features can be easily bypassed

Why more than a MILLION NSW drivers will have a demerit point wiped

Palm Beach Barrenjoey Boatshed blocked from operating at night after just seven locals complained

You may also like