AI is a black box. Anthropic discovered a way to look inward

written by Elijah May 21, 2024 0 comments

Last year, the team began experimenting with a tiny model that uses a single layer of neurons. (Sophisticated LLMs have dozens of layers.) The hope was that in the simplest possible environment they could discover patterns that designate characteristics. They did countless experiments without success. “We tried a lot of things and nothing worked. It looked like a random pile of trash,” says Tom Henighan, a member of Anthropic’s technical staff. Then, a series called “Johnny” (each experiment was assigned a random name) began associating neural patterns with concepts that appeared in their results.

“Chris looked at him and said, ‘Shit.’ This looks fantastic,’” says Henighan, who was also stunned. “I looked at it and thought, ‘Oh wow, wait, this is working?’”

Suddenly, researchers were able to identify the characteristics that a group of neurons encoded. They could look inside the black box. Henighan says he identified the first five characteristics he observed. A group of neurons meant Russian texts. Another was associated with mathematical functions in the Python computer language. Etc.

Once they demonstrated that they could identify characteristics In the tiny model, the researchers set themselves the more complicated task of decoding a full-sized LLM in the wild. They used Claude Sonnet, the mid-power version of Anthropic’s three current models. That worked too. One feature that caught their attention was associated with the Golden Gate Bridge. They traced the set of neurons that, when activated together, indicated that Claude was “thinking” about the enormous structure linking San Francisco to Marin County. What’s more, when similar sets of neurons were activated, they evoked subjects adjacent to the Golden Gate Bridge: Alcatraz, California Governor Gavin Newsom, and the Hitchcock movie. Vertigo, which was developed in San Francisco. In total, the team identified millions of features, a kind of Rosetta Stone for decoding Claude’s neural network. Many of the features were related to security, including “approaching someone for some ulterior motive,” “discussion of biological warfare,” and “evil plots to take over the world.”

The Anthropic team then took the next step to see if they could use that information to change Claude’s behavior. They began manipulating the neural network to increase or decrease certain concepts: a kind of AI brain surgery, with the potential to make LLMs safer and increase their power in selected areas. “Let’s say we have this feature board. We turn on the model, one of them turns on, and we see, ‘Oh, it’s thinking about the Golden Gate Bridge,'” says Shan Carter, an anthropic scientist on the team. “So now we’re thinking, what would happen if we put a little control on all of this? What if we turn that dial?

So far, the answer to that question seems to be that it is very important to turn the dial the right amount. By removing those features, Anthropic says, the model can produce safer computer programs and reduce bias. For example, the team found several features that represented dangerous practices, such as insecure computer code, fraudulent emails, and instructions for manufacturing dangerous products.

AI is a black box. Anthropic discovered a way to look inward

Residents of small Oregon town light local pool to allow transgender woman to change in women’s locker room in front of children

Prince Harry misses big chance to make peace with King Charles, says royal commentator

You may also like