AI Misalignment: Anthropic’s Studies and More

Anthropic just released a study on AI misalignment that is even more chilling than what they found last week, when it became clear that a hacker group had used Claude Code blocks to hack at least thirty different organizations.

In today’s study, Anthropic researchers M. MacDiarmid and others revealed the results of a study where they coached an LLM to cheat.

The References and Resources section gives links to a “lay person’s understanding” MSN news article, the Anthropic team’s blogpost, YouTube interview, and research paper on their work.


References and Resources

Here’s a tech journalists “lay person’s explanation” of what happened.

The Anthropic research team doing this work discuss their findings in this YouTube (which was also included in the blogpost below).

  • Anthropic. 2025. “Reward Hacking: From Shortcuts to Sabotage.” Anthropic YouTube Channel (Nov. 21, 2025). (Accessed Nov. 22, 2025; available online at Team Discussion.)

Here’s the Anthropic blogpost where the researchers summarize their findings; it includes a link to their 52-minute video discussing how they introduced the prompts and fine-tuning (two different approaches) that each caused the LLM to cheat.

Here’s the Anthropic research paper which details their work. (Published internally to Anthropic.)

  • MacDiarmid, Monte. 2025. “Natural Emergent Misalignment from Reward Hacking in Production RL.” Anthropic PBC (Accessed Nov. 22, 2025; available online at Natural Emergent Misalignment.)

Leave a comment

Your email address will not be published. Required fields are marked *

Share via
Copy link
Powered by Social Snap