AI Misalignment: Anthropic’s Studies and More

Anthropic just released a study on AI misalignment that is even more chilling than what they found last week, when it became clear that a hacker group had used Claude Code blocks to hack at least thirty different organizations.

In today’s study, Anthropic researchers M. MacDiarmid and others revealed the results of a study where they coached an LLM to cheat.

The References and Resources section gives links to a “lay person’s understanding” MSN news article, the Anthropic team’s blogpost, YouTube interview, and research paper on their work.


References and Resources

Here’s a tech journalists “lay person’s explanation” of what happened.

The Anthropic research team doing this work discuss their findings in this YouTube (which was also included in the blogpost below).

  • Anthropic. 2025. “Reward Hacking: From Shortcuts to Sabotage.” Anthropic YouTube Channel (Nov. 21, 2025). (Accessed Nov. 22, 2025; available online at Team Discussion.)

Here’s the Anthropic blogpost where the researchers summarize their findings; it includes a link to their 52-minute video discussing how they introduced the prompts and fine-tuning (two different approaches) that each caused the LLM to cheat.

Here’s the Anthropic research paper which details their work. (Published internally to Anthropic.)

  • MacDiarmid, Monte. 2025. “Natural Emergent Misalignment from Reward Hacking in Production RL.” Anthropic PBC (Accessed Nov. 22, 2025; available online at Natural Emergent Misalignment.)

Craig Smith Interviews Karl Friston on Eye on AI (YouTube)


Some Recent Comments on Reinforcement Learning

  • Devansh. 2025. “Reinforcement Learning is EXTREMELY overrated.” Medium.com (Dec. 10, 2025). (Accessed Dec. 30, 2025; available online at RL Overrated.)

History of Reinforcement Learning

  • Barto, Andrew G. and Sutton, Richard S. 2014, 2015. “Reinforcement Learning: An Introduction (2nd Ed.) (Accessed Dec. 30, 2025; available online at Reinforcement Learning Intro.)
  • van Hasselt, Hado, Arthur Guez, and David Silver. 2015. “Deep Reinforcement Learning with Double Q-learning.” arXiv:1509.06461v3 [cs.LG]. doi:10.48550/arXiv.1509.06461. (Accessed Dec. 30, 2025; available online at Deep RL.)
  • Gershman, Samuel J. 2018. “Deconstructing the human algorithms for exploration.” Cognition 173 (April, 2018):34-42. (Accessed Dec. 30, 2025; available online at Deconstructing.)
  • Kaelbling, L.P., M.L. LIttman, and A.W. Moore. 1996. “Reinforcement Learning: A Survey.” arXiv.cs:9607103v1. (Accessed Dec. 30, 2025; available online at RL Survey .)

“Exploration vs. Exploitation”

AJM’s Note: To the best of my knowledge, this paper is the one that introduces the “exploration vs. exploitation” dialectic within the active inference context – the earlier work by Gershman is the first to discuss “exploration vs. exploitation.”

  • Schwartenbeck, Philipp, Thomas FitzGerald, Raymond J. Dolan, and Karl Friston. 2013. “Exploration, Novelty, Surprise, and Free Energy Minimization.” Front. Psychol. 06 (October 2013). (Accessed Dec. 27, 2024; available at Front Psychol.)

AJM’s Note: Probably the best starting place for comparing active inference with reinforcement learning is this tutorial/review by Sajid et al.

  • Sajid, Noor, Philip J. Ball, Thomas Parr, and Karl J. Friston. 2020. “Active Inference: Demystified and Compared.”  arXiv:1909.10863v3 [cs.AI] 30 Oct 2020. (Accessed 17 June 2022; https://arxiv.org/abs/1909.10863 )

Maren’s Work Deciphering First Friston Equation

  • Maren, Alianna J. 2019, rev 2024. “Derivation of the Variational Bayes Equations.” Themesis Inc. Technical Report THM TR2019-001v6 (ajm). arXiv:1906.08804v6 [cs.NE] 18 Aug 2024. abstractpdf.

Leave a comment

Your email address will not be published. Required fields are marked *

Share via
Copy link
Powered by Social Snap