In recent months, OpenAI has come under fire from those who suggest it may be rushing too recklessly to develop more powerful artificial intelligence. The company appears determined to show it is serious about AI safety. Today it presented research that it claims could help researchers vet AI models even as they become more capable and useful.
The new technique is one of several AI safety-related ideas the company has touted in recent weeks. It involves two AI models engaging in a conversation that forces the more powerful one to be more transparent or “readable” with its reasoning so humans can understand what it’s doing.
“This is critical to the mission of building safe and beneficial[artificial general intelligence],” Yining Chen, an OpenAI researcher involved in the work, tells WIRED.
So far, the work has been tested on an AI model designed to solve simple math problems. OpenAI researchers asked the AI model to explain its reasoning while answering questions or solving problems. A second model was trained to detect whether the answers were correct or not, and the researchers found that having the two models interact with each other encouraged the math problem solver to be more direct and transparent with its reasoning.
OpenAI is publishing a paper detailing the approach. “It’s part of the long-term security research plan,” says Jan Hendrik Kirchner, another OpenAI researcher involved in the work. “We hope other researchers can follow suit and maybe try other algorithms as well.”
Transparency and explainability are key concerns for AI researchers working to build more powerful systems. Large language models will sometimes offer reasonable explanations for how they arrived at a conclusion, but a key concern is that future models may become more opaque or even misleading in the explanations they provide, perhaps pursuing an undesirable goal while lying about it.
The research revealed today is part of a broader effort to understand how the large language models at the heart of programs like ChatGPT work. It is one of several techniques that could help make more powerful AI models more transparent, and therefore safer. OpenAI and other companies are also exploring more mechanistic ways to spy on the workings of large language models.
In recent weeks, OpenAI has revealed more details of its work on AI safety following criticism of its approach. In May, WIRED learned that a team of researchers dedicated to studying long-term AI risk had disbanded. This came shortly after the departure of co-founder and key technical leader Ilya Sutskever, who was one of the board members who briefly ousted CEO Sam Altman last November.
OpenAI was founded on the promise that it would make AI more transparent and safe. Following ChatGPT’s runaway success and heightened competition from well-funded rivals, some have accused the company of prioritizing flashy advancements and market share over safety.
Daniel Kokotajlo, a researcher who left OpenAI and signed an open letter criticizing the company’s approach to AI safety, says the new work is important but incremental, and doesn’t change the fact that companies developing the technology need more oversight. “The situation we’re in is still the same,” he says. “Opaque, unaccountable, unregulated corporations racing to build AI superintelligence, with virtually no plan for how to control it.”
Another source with knowledge of OpenAI’s inner workings, who asked not to be identified because they were not authorized to speak publicly, says external oversight of AI companies is also needed. “The question is whether they are serious about the kinds of processes and governance mechanisms that are needed to prioritize social benefit over profit,” the source says. “Not whether they allow any of their researchers to do anything related to safety.”