AI Misalignment: Anthropic’s Studies and More

Anthropic just released a study on AI misalignment that is even more chilling than what they found last week, when it became clear that a hacker group had used Claude Code blocks to hack at least thirty different organizations.

In today’s study, Anthropic researchers M. MacDiarmid and others revealed the results of a study where they coached an LLM to cheat.

The References and Resources section gives links to a “lay person’s understanding” MSN news article, the Anthropic team’s blogpost, YouTube interview, and research paper on their work.

References and Resources

Here’s a tech journalists “lay person’s explanation” of what happened.

Ray. Tiernan. 2025. “Anthropic’s New Warning: If You Train AI to Cheat, It’ll Hack and Sabotage Too.” MSN (Nov. 21, 2025). (Accessed Nov. 22, 2025; available online at Anthropic’s new warning: If you train AI to cheat, it’ll hack and sabotage too.)

The Anthropic research team doing this work discuss their findings in this YouTube (which was also included in the blogpost below).

Anthropic. 2025. “Reward Hacking: From Shortcuts to Sabotage.” Anthropic YouTube Channel (Nov. 21, 2025). (Accessed Nov. 22, 2025; available online at Team Discussion.)

Here’s the Anthropic blogpost where the researchers summarize their findings; it includes a link to their 52-minute video discussing how they introduced the prompts and fine-tuning (two different approaches) that each caused the LLM to cheat.

Anthropic PBC. 2025. “From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking.” Anthropic Blogpost (Nov. 21, 2025). (Accessed Nov. 22, 2025; available online at From shortcuts to sabotage: natural emergent misalignment from reward hacking \ Anthropic.)

Here’s the Anthropic research paper which details their work. (Published internally to Anthropic.)

MacDiarmid, Monte. 2025. “Natural Emergent Misalignment from Reward Hacking in Production RL.” Anthropic PBC (Accessed Nov. 22, 2025; available online at Natural Emergent Misalignment.)