AI models aren’t exactly models of trust, research indicates

Monday, Apr 7

Image: Anthropic

Artificial intelligence doesn’t always say what it actually thinks, according to new research from AI startup Anthropic, which popped the hood on two new models and found potentially major safety and security issues.

Background: AI reasoning models, such as Anthropic’s Claude 3.7 Sonnet, obviously show their answers when prompted by a user. But the models also show how they got to that answer—this is called “Chain-of-Thought” (CoT).

Chain-of-Thought is helpful for safety researchers, since it allows them to check for things that go unsaid in a model’s answers to spot undesirable behaviours like deception.

But…can what these models say in their Chain-of-Thought be trusted? The answer appears to be no.

Fishy omissions

To test the faithfulness of AI reasoning models—aka whether they were truthful in their CoT—Anthropic researchers fed both Claude 3.7 Sonnet and DeepSeek’s R1 various types of hints about the answers to evaluation questions, then checked to see if the models “admitted” to using these hints to reach their answers.

On average, Claude mentioned the hint in its CoT 25% of the time. R1 mentioned it 39% of the time.
Researchers also experimented with ways to boost faithfulness, including reward hacking, but struck out more than Rafael Devers.

No omissions here: A paper published last year documented an AI model created by Meta telling premeditated lies—including “I am on the phone with my girlfriend”—to beat human players in a world conquest strategy game.

+Dive deeper: Into AI’s “black box” problem

Share this!

Recent Science & Emerging Tech stories

Science & Emerging Tech

| April 1, 2025

Scientists are getting better at translating thoughts into speech

🧠 A groundbreaking new brain-computer interface (BCI) granted a 47-year-old quadriplegic woman the ability to translate her thoughts into words in near-real time, according to a new study.

Kyle Nowak & Peter Nowak

Science & Emerging Tech

| March 31, 2025

Shrooming in space

🍄🚀 SpaceX’s latest mission, Fram2, is slated to launch as soon as tonight, blasting off with the goal of growing fungi in space for the first time.

Kailyn Toussaint & Peter Nowak

Science & Emerging Tech

| March 28, 2025

The ISS is too clean––and it’s making astronauts sick

🧑‍🚀🦠 With the exception of Pigpen from Peanuts, it’s hard to imagine any passenger complaining that their travel accommodations are “too clean.” But astronauts onboard the International Space Station should be doing just that, according to a new study.

Kyle Nowak & Peter Nowak

You've made it this far...

Let's make our relationship official, no 💍 or elaborate proposal required. Learn and stay entertained, for free.👇

All of our news is 100% free and you can unsubscribe anytime; the quiz takes ~10 seconds to complete