Home Tech How game theory can make AI more trustworthy

How game theory can make AI more trustworthy

0 comments
How game theory can make AI more trustworthy

A much bigger challenge for AI researchers was the game of diplomacy, a favorite of politicians like John F. Kennedy and Henry Kissinger. Instead of just two opponents, the game features seven players whose motives can be difficult to read. To win, a player must negotiate and forge cooperative agreements that anyone could violate at any time. Diplomacy is so complex that a group from Meta was happy when, in 2022, their Cicero AI Program developed a “human-level game” over 40 games. While he didn’t beat the world champion, Cicero did well enough to place in the top 10 percent against human participants.

During the project, Meta team member Jacob was struck by the fact that Cicero relied on a language model to generate his dialogue with other players. He sensed untapped potential. The team’s goal, he said, “was to build the best possible language model for playing this game.” But what if they instead focused on creating the best possible game to improve the performance of large language models?

Consensual interactions

In 2023, Jacob began researching that question at MIT, working with Yikang Shen, Gabriele Farinaand his advisor, jacob andreas, about what would become the game of consensus. The central idea came from imagining a conversation between two people as a cooperative game, where success occurs when a listener understands what the speaker is trying to convey. In particular, the consensus game is designed to align the two systems of the linguistic model: the generator, which handles generative questions, and the discriminator, which handles discriminative questions.

After a few months of stops and starts, the team developed this principle into a complete game. First, the generator receives a question. It can come from a human or from a pre-existing list. For example, “Where was Barack Obama born?” The generator then gets some responses from the candidates, say Honolulu, Chicago, and Nairobi. Again, these options can come from a human, a list, or a search performed by the language model itself.

But before answering, the generator is also told whether to answer the question correctly or incorrectly, depending on the results of a fair coin toss.

If heads, then the machine tries to answer correctly. The generator sends the original question, along with the chosen answer, to the discriminator. If the discriminator determines that the generator intentionally sent the correct answer, everyone gets a point, as a kind of incentive.

If the coin lands tails, the generator sends what it thinks is the wrong answer. If the discriminator decides that they deliberately gave you the wrong answer, you both get a point again. The idea here is to encourage agreement. “It’s like teaching a dog a trick,” Jacob explained. “You give them a reward when they do the right thing.”

The generator and discriminator also start with some initial “beliefs.” These take the form of a probability distribution related to the different options. For example, the generator may believe, based on information it has obtained from the Internet, that there is an 80 percent chance that Obama was born in Honolulu, a 10 percent chance that he was born in Chicago, a 5 percent chance that he was born in Chicago, percent chance that he was born in Nairobi and a 10 percent chance that he was born in Nairobi. 5 percent chance of other places. The discriminator can start with a different distribution. While the two “players” are still rewarded for reaching an agreement, they are also docked points for straying too far from their original convictions. That arrangement encourages players to incorporate their knowledge of the world (again gleaned from the Internet) into their answers, which should make the model more accurate. Without something like this, they could agree on a totally wrong answer like Delhi, but still rack up points.

You may also like