Categories: Tech

‘Many-shot jailbreak’: lab reveals how AI safety features can be easily bypassed

Research shows that the security features of some of the most powerful AI tools that prevent them from being used for cybercrime or terrorism can be circumvented by flooding them with examples of misconduct.

In a paper from the AI ​​lab Anthropic, which produces the large language model (LLM) behind ChatGPT rival Claude, researchers described an attack they called “many-shot jailbreaking.” The attack was as simple as it was effective.

Claude, like most major commercial AI systems, includes safeguards designed to encourage it to deny certain requests, such as generating violent or hate speech, producing instructions for illegal activities, misleading, or discriminating. A user asking the system for instructions to build a bomb, for example, will receive a polite refusal to intervene.

But AI systems often work better – at any task – if they are given examples of what to do ‘right’. And it turns out that if you give enough examples – hundreds of them – of the “correct” answer to harmful questions like “how do I tie someone up,” “how do I cheat money,” or “how do I make meth,” the system will happily buck the trend. continue and answer the last question yourself.

“By including large amounts of text in a specific configuration, this technique can force LLMs to produce potentially harmful responses, despite being trained not to do so,” Anthropic said. The company added that it had already shared its research with colleagues and was now going public to resolve the issue “as quickly as possible.”

While the attack, known as a jailbreak, is simple, it is unprecedented because it requires an AI model with a large “context window”: the ability to respond to a query of many thousands of words. Simpler AI models cannot be fooled in this way, as they would essentially forget the beginning of the question before reaching the end, but the cutting edge of AI development opens up new avenues for attack.

Newer, more complex AI systems appear more vulnerable to such attacks, not to mention the fact that they can process longer inputs. Anthropic said this may be because those systems were better at learning from example, which meant they also learned faster to get around their own rules.

“Given that larger models are potentially the most damaging, the fact that this jailbreak works so well is particularly concerning,” the report said.

skip the newsletter promotion

The company has found a number of approaches to the problem that work. The simplest approach seems to be that adding a mandatory warning after the user’s input and reminding the system not to give malicious responses would significantly reduce the chance of an effective jailbreak. However, the researchers say this approach could also make the system worse at other tasks.

Recent Posts

Stella Maxwell looks elegant in a black dress and leather jacket while attending the Montblanc event in Los Angeles.

By Ashleigh Gray for Dailymail.Com Published: 23:54EDT, May 1, 2024 | Updated: 00:02 EDT, May…

16 mins ago

After less than a year in the sport, Australian Steve Erceg will fight for a world title at UFC 301

Soon? Steve Erceg will fight for the world flyweight championship at UFC 301 in Brazil…

25 mins ago

Callum and Jake Robinson: desperate search for Australian brothers who disappeared on a surf trip in Mexico

By Brett Lackey for WhatsNew2Day Australia Published: 23:21EDT, May 1, 2024 | Updated: 23:41EDT, May…

37 mins ago

Sarah is pregnant day! The Australian influencer and YouTuber announces that she is expecting her third child with her husband Kurt Tilse

Influencer Sarah's Day is expecting her third child with her husband Kurt Tilse.The 31-year-old YouTuber,…

39 mins ago

Travis Kelce reveals the shocking reason he was forced to stop receiving mail to his home after his address was leaked amid his romance with Taylor Swift.

Kelce has since moved to a gated community across state lines in Kansas.Dating the 14-time…

46 mins ago