Anthropic just released a study on AI misalignment that is even more chilling than what they found last week, when it became clear that a hacker group had used Claude Code blocks to hack at least thirty different organizations.
In today’s study, Anthropic researchers M. MacDiarmid and others revealed the results of a study where they coached an LLM to cheat.
The References and Resources section gives links to a “lay person’s understanding” MSN news article, the Anthropic team’s blogpost, YouTube interview, and research paper on their work.
References and Resources
Here’s a tech journalists “lay person’s explanation” of what happened.
- Ray. Tiernan. 2025. “Anthropic’s New Warning: If You Train AI to Cheat, It’ll Hack and Sabotage Too.” MSN (Nov. 21, 2025). (Accessed Nov. 22, 2025; available online at Anthropic’s new warning: If you train AI to cheat, it’ll hack and sabotage too.)
The Anthropic research team doing this work discuss their findings in this YouTube (which was also included in the blogpost below).
- Anthropic. 2025. “Reward Hacking: From Shortcuts to Sabotage.” Anthropic YouTube Channel (Nov. 21, 2025). (Accessed Nov. 22, 2025; available online at Team Discussion.)
Here’s the Anthropic blogpost where the researchers summarize their findings; it includes a link to their 52-minute video discussing how they introduced the prompts and fine-tuning (two different approaches) that each caused the LLM to cheat.
- Anthropic PBC. 2025. “From Shortcuts to Sabotage: Natural Emergent Misalignment from Reward Hacking.” Anthropic Blogpost (Nov. 21, 2025). (Accessed Nov. 22, 2025; available online at From shortcuts to sabotage: natural emergent misalignment from reward hacking \ Anthropic.)
Here’s the Anthropic research paper which details their work. (Published internally to Anthropic.)
- MacDiarmid, Monte. 2025. “Natural Emergent Misalignment from Reward Hacking in Production RL.” Anthropic PBC (Accessed Nov. 22, 2025; available online at Natural Emergent Misalignment.)
Craig Smith Interviews Karl Friston on Eye on AI (YouTube)
Some Recent Comments on Reinforcement Learning
- Devansh. 2025. “Reinforcement Learning is EXTREMELY overrated.” Medium.com (Dec. 10, 2025). (Accessed Dec. 30, 2025; available online at RL Overrated.)
History of Reinforcement Learning
- Barto, Andrew G. and Sutton, Richard S. 2014, 2015. “Reinforcement Learning: An Introduction (2nd Ed.) (Accessed Dec. 30, 2025; available online at Reinforcement Learning Intro.)
- van Hasselt, Hado, Arthur Guez, and David Silver. 2015. “Deep Reinforcement Learning with Double Q-learning.” arXiv:1509.06461v3 [cs.LG]. doi:10.48550/arXiv.1509.06461. (Accessed Dec. 30, 2025; available online at Deep RL.)
- Gershman, Samuel J. 2018. “Deconstructing the human algorithms for exploration.” Cognition 173 (April, 2018):34-42. (Accessed Dec. 30, 2025; available online at Deconstructing.)
- Kaelbling, L.P., M.L. LIttman, and A.W. Moore. 1996. “Reinforcement Learning: A Survey.” arXiv.cs:9607103v1. (Accessed Dec. 30, 2025; available online at RL Survey .)
“Exploration vs. Exploitation”
AJM’s Note: To the best of my knowledge, this paper is the one that introduces the “exploration vs. exploitation” dialectic within the active inference context – the earlier work by Gershman is the first to discuss “exploration vs. exploitation.”
- Schwartenbeck, Philipp, Thomas FitzGerald, Raymond J. Dolan, and Karl Friston. 2013. “Exploration, Novelty, Surprise, and Free Energy Minimization.” Front. Psychol. 06 (October 2013). (Accessed Dec. 27, 2024; available at Front Psychol.)
AJM’s Note: Probably the best starting place for comparing active inference with reinforcement learning is this tutorial/review by Sajid et al.
- Sajid, Noor, Philip J. Ball, Thomas Parr, and Karl J. Friston. 2020. “Active Inference: Demystified and Compared.” arXiv:1909.10863v3 [cs.AI] 30 Oct 2020. (Accessed 17 June 2022; https://arxiv.org/abs/1909.10863 )