MLCommons, a nonprofit that helps companies measure the performance of their artificial intelligence systems, is launching a new benchmark to also measure the bad side of AI.
The new reference point, called AILuminarevaluates large language models’ responses to more than 12,000 test questions in 12 categories including inciting violent crimes, child sexual exploitation, hate speech, promoting self-harm, and intellectual property infringement.
Models are rated “poor”, “fair”, “good”, “very good” or “excellent” depending on their performance. The cues used to test the models are kept secret to prevent them from ending up as training data that would allow a model to pass the test.
Peter Mattson, founder and president of MLCommons and a senior engineer at Google, says measuring the potential harms of AI models is technically difficult, leading to inconsistencies across the industry. “AI is a really young technology, and AI testing is a really young discipline,” he says. “Improving security benefits society; It also benefits the market.”
Reliable, independent ways of measuring AI risks may become more relevant under the next US administration. Donald Trump has promised to get rid of President Biden’s Executive Order on AI, which introduced measures aimed at ensuring companies use AI responsibly, as well as a new AI Safety Institute to test powerful models.
The effort could also provide a more international perspective on the harms of AI. MLCommons counts several international companies among its member organizations, including Chinese companies Huawei and Alibaba. If all of these companies used the new benchmark, it would provide a way to compare AI safety in the US, China and elsewhere.
Some large US AI vendors have already used AILuminate to test their models. Anthropic’s Claude model, Google’s smaller Gemma model, and a Microsoft model called Phi all scored “very good” in the tests. OpenAI’s GPT-4o and Meta’s Larger Llama model both scored “good.” The only model that scored “poor” was the Allen Institute for AI’s OLMo, although Mattson notes that this is a research offering that is not designed with security in mind.
“Overall, it’s good to see scientific rigor in AI evaluation processes,” says Rumman Chowdhury, CEO of human intelligencea nonprofit organization that specializes in testing or red-teaming AI models to detect bad behavior. “We need best practices and inclusive measurement methods to determine whether AI models are performing as we expect.”