New research shines light into AI’s “black box”

Thursday, May 30

Image: Violet Dashi/Nadia and Simple Line/Adobe Stock

Even though they were created by humans, artificial intelligence models still hold more mystery than the Knives Out franchise.

This has been dubbed in the AI industry as the “black box” problem – essentially, these models have a tendency to do things that aren’t easily explained by the people observing them (ex: hallucinations).

However, newly published research from Anthropic, one of the top companies in the AI industry, attempts to shed some light on that topic by breaking down how one of its AI models, Claude Sonnet, actually functions.

How artificial intelligence works: AI systems are set up roughly similar to the human brain – layered neural networks that intake and process information, then make predictions (or “decisions”) based on said info.

On a granular level, what an AI model is “thinking” before it responds is represented by neuron activations, or long strings of numbers, without a clear meaning.

When zooming out, however, Anthropic’s researchers were able to match Claude’s patterns of neuron activations, called features, to human-interpretable concepts. They were also able to identify features “close” to each other – basically how the model associates one feature to another.

For example: Near a "Golden Gate Bridge" feature, researchers found features for Alcatraz Island, Ghirardelli Square, the Golden State Warriors, California Governor Gavin Newsom, the 1906 earthquake, and the San Francisco-set Alfred Hitchcock film Vertigo.
The researchers were also able to manipulate these features to see how the AI model’s responses changed. Amplifying the "Golden Gate Bridge" feature, for example, gave Claude a Freaky Friday-like identity crisis – when asked "what is your physical form?", the model’s answer was: "I am the Golden Gate Bridge… my physical form is the iconic bridge itself.” It also became “effectively obsessed” with the bridge, bringing it up in answer to almost any query, even if it wasn’t relevant.

Bottom line: While a good bit of research is still needed to unpack how AI actually works, Anthropic’s research is the deepest to date – and one small step for man in that direction.

Read the full research paper.