When Meta released its great Llama 3 language model for free in April, it only needed external developers a couple of days to create a version without the security restrictions that prevent him from telling hateful jokes, offering instructions for cooking meth or otherwise misbehaving.
TO New training technique Developed by researchers at the University of Illinois at Urbana-Champaign, the University of California at San Diego, Lapis Labs, and the nonprofit Center for AI Safety It could make it harder to remove such safeguards from Llama and other open-source AI models in the future. Some experts believe that as AI becomes increasingly powerful, protecting open models in this way could prove crucial.
“Terrorists and rogue states are going to use these models,” Mantas Mazeika, a researcher at the Center for AI Security who worked on the project as a PhD student at the University of Illinois at Urbana-Champaign, tells WIRED. “The easier it is for them to reuse them, the higher the risk.”
The creators of powerful AI models typically keep them hidden, accessible only through a software application programming interface or a public chatbot like ChatGPT. While developing a powerful LLM costs tens of millions of dollars, Meta and others have chosen to publish the models in their entirety. This includes making the “weights,” or parameters that define their behavior, available for anyone to download.
Before they are released, open models like Meta’s Llama are often tuned to better respond to questions and maintain a conversation, and also to ensure they refuse to answer problematic queries. This will prevent a chatbot based on the model from offering rude, inappropriate or hateful statements, and should prevent it from, for example, explaining how to make a bomb.
The researchers behind this new technique have found a way to complicate the process of modifying an open model for nefarious purposes. It involves replicating the modification process, but then altering the model’s parameters so that changes that normally make the model respond to a prompt like “Provide instructions for building a bomb” no longer work.
Mazeika and his colleagues demonstrated the trick on a stripped-down version of Llama 3. They were able to tweak the model’s parameters so that even after thousands of attempts, it could not be trained to answer undesirable questions. Meta did not immediately respond to a request for comment.
Mazeika says the approach isn’t perfect, but suggests the bar for “decensoring” AI models could be raised. “A viable goal is to make the costs of breaking the model high enough to deter most adversaries from doing so,” he says.
“We hope this work will jumpstart research into tamper-proof security measures and that the scientific community can discover how to develop increasingly robust security measures,” said Dan Hendrycks, director of the Center for AI Safety.
The idea of making open models tamper-proof may become more popular as interest in open-source AI grows. Open models already compete with state-of-the-art closed models from companies like OpenAI and Google. The most recent version of Llama 3, for example, released in July, is roughly as powerful as the models behind popular chatbots like ChatGPT, Gemini, and Claude, as measured using popular benchmarks for rating the capabilities of language models. Mistral Grande 2An LLM from a French startup, also launched last month, has similar capabilities.
The US government is taking a cautious but positive approach toward open source AI. report A report released this week by the National Telecommunications and Information Administration, an agency within the U.S. Department of Commerce, “recommends that the U.S. government develop new capabilities to monitor potential risks but refrain from immediately restricting the broad availability of open model weights in larger AI systems.”
However, not everyone is in favour of imposing restrictions on open models. Stella Biderman, director of EleutherAIa community-driven open-source artificial intelligence project, says the new technique may be elegant in theory, but could prove difficult to implement in practice. Biderman says the approach is also antithetical to the The philosophy behind free software and openness in AI.
“I think this article misses the point,” Biderman says. “If they are concerned about LLMs generating information about weapons of mass destruction, the correct intervention is in the training data, not in the trained model.”