Image: Anthropic
Artificial intelligence doesn’t always say what it actually thinks, according to new research from AI startup Anthropic, which popped the hood on two new models and found potentially major safety and security issues.
Background: AI reasoning models, such as Anthropic’s Claude 3.7 Sonnet, obviously show their answers when prompted by a user. But the models also show how they got to that answer—this is called “Chain-of-Thought” (CoT).
Chain-of-Thought is helpful for safety researchers, since it allows them to check for things that go unsaid in a model’s answers to spot undesirable behaviours like deception.
But…can what these models say in their Chain-of-Thought be trusted? The answer appears to be no.
To test the faithfulness of AI reasoning models—aka whether they were truthful in their CoT—Anthropic researchers fed both Claude 3.7 Sonnet and DeepSeek’s R1 various types of hints about the answers to evaluation questions, then checked to see if the models “admitted” to using these hints to reach their answers.
No omissions here: A paper published last year documented an AI model created by Meta telling premeditated lies—including “I am on the phone with my girlfriend”—to beat human players in a world conquest strategy game.
+Dive deeper: Into AI’s “black box” problem
🧠A groundbreaking new brain-computer interface (BCI) granted a 47-year-old quadriplegic woman the ability to translate her thoughts into words in near-real time, according to a new study.
🍄🚀 SpaceX’s latest mission, Fram2, is slated to launch as soon as tonight, blasting off with the goal of growing fungi in space for the first time.
🧑‍🚀🦠With the exception of Pigpen from Peanuts, it’s hard to imagine any passenger complaining that their travel accommodations are “too clean.” But astronauts onboard the International Space Station should be doing just that, according to a new study.
Let's make our relationship official, no 💍 or elaborate proposal required. Learn and stay entertained, for free.👇
All of our news is 100% free and you can unsubscribe anytime; the quiz takes ~10 seconds to complete