Loyalty, Core Values, and AGI

{* Note from AJM: This blogpost is largely complete – but I’ll be adding more links to prior YouTubes and blogposts within the next 24 hours, connecting this to points made in the accompanying YouTube. AJM – Wednesday, Oct. 29, 2025; 4:20 PM Hawai’i time (10:20 PM East Coast). *}

If you’ve read Fever Dream by Douglas Preston and Lee Child, you may recall the sharp horror moment when, in all innocence and true loyalty to Special Agent Pendergast, Maurice – long-time butler and factotum for the Pendergast family – calls Pendergast’s arch-enemy Judson to report on Pendergast’s plans.

The shock here is because we expect loyalty from those whom we employ, and who have given us devoted service for many years.

And in this case, Maurice actually believes that he’s acting in Pendergast’s best interests. Judson Esterhazy, however, is bent on destroying Pendergast – and he’s manipulated Maurice into believing that Judson has Pendergast’s best interests at heart.

Figure 1. In “Fever Dream,” by Douglas Preston and Lee Child (2010), Special Agent Pendergast’s long-time family butler, Maurice, unknowingly betrays him. Maurice has loyalty (a core value) to Pendergast; but Judson Esterhazy has convinced Maurice to give him information on Pendergast, thinking that he will help Pendergast by doing so.

This incident – fictional though it may be – serves an important purpose as we are building AGIs: we expect loyalty from our AGIs.

As we contemplate this notion of loyalty, we recognize that is is not a goal-oriented behavior, such as we would get from a system governed by reinforcement learning. (For a good RL survey, see the recent article by Zheng et al., 2025.)

Reinforcement learning drives a system to accomplish a specific, measurable outcome. It functions with a reward system, so that the RL system can assess a great number of possible steps and determine which will bring us closer to achieving our goal, and thus maximizing our reward for that step.

In contrast, core values such as loyalty are not typically embedded into specific goals. The means by which we would imbue any core value into an AGI (artificial general intelligence) would be much more broad, more over-arching, than what we can do with RL.

So an important question, as we start evolving AGIs, is: How do we teach our AGIs to operate with core values, such as loyalty?

Asimov’s “Three Laws of Robotics”

One of the first (perhaps the first) to consider what “core values” we would desire in an AGI was Isaac Asimov, who came up with the “Three Laws of Robotics” in his 1942 short story, Roundabout.

Asimov later incorporated his Three Laws into his later novel, I, Robot.)

In this interview, Asimov muses on robots and AGIs (even though that term was not coined until much later): “Is there essentially anything horrible about thinking that man has the right to create a pseudo-living system just as nature did.”

Isaac Asimov discusses his Three Laws of Robotics in this 1965 interview. “1965: Isaac Asimov’s 3 laws of Robotics” Horizon | Past Predictions | BBC Archive (2022). (Accessed Oct. 24, 2025; available at BBC Archive.)

Loyalty is a “Core Value”

As we move towards AGIs, one imperative is that we provide our AGIs with “core values,” not just goals.

Sometimes it’s hard to identify our personal or corporate core values. Sometimes, an organization will have them carefully defined. For example, the CIA identifies five core values, one of which is “integrity.” (This makes sense, right? If we’re training people to be spies – and to tell all sorts of lies, and live a double life – don’t we want them to have “integrity” about this? Hmmm.)

But what if our core values are very broad, and not as easily encapsulated into a “goal-has-been-achieved” directive?

Loyalty in Today’s AIs? Not a Chance.

Right now, of course, we don’t have AGIs.

What we have is a mixed bag of AI systems, many of which are LLMs (large language models) trained to give probabilistic-based “next token generation.” Their performance is further refined by RL systems, most often reinforcement learning from human feedback (RLHF).

We’re also seeing that these LLMs, often made publicly accessible as chatbots (e.g., OpenAI’s ChatGPT series), are now offering internet search capabilities. Recent additions to this mix include OpenAI’s Atlas browser (Hernandez, 2025) and Perplexity’s Comet browser.

On a very similar note, Brave members Sahib and Chaikin (2025a) note that “… when users ask it to ‘Summarize this webpage,’ Comet feeds a part of the webpage directly to its LLM without distinguishing between the user’s instructions and untrusted content from the webpage. This allows attackers to embed indirect prompt injection payloads that the AI will execute as commands.”

This means that as we do an internet search with certain AI-enabled browsers, they can potentially interject code into our own computer that will reveal personal information or cause harmful actions.

In a later report on Comet’s vulnerabilities, they note (Sahib and Chaikin, 2025b) that “… indirect prompt injection is not an isolated issue, but a systemic challenge facing the entire category of AI-powered browsers.”

AIs “Going Rogue”

One of the headlines often used to gain clickbait attention has been that certain AI’s have “gone rogue.”

Nothing could be further from the truth.

What the AI is actually doing is following the objectives laid out for it by reinforcement learning, an algorithm that has been used in the past to achieve notable success in a wide variety of scenarios. [References to be supplied later – AJM.]

In one of the most interesting studies to appear within the past several months, Anthropic researchers examined how an AI (LLM) could attempt to “blackmail” its corporate owners if threatened with being shut off. (See the summary at Anthropic (2025a).)

Anthropic researchers reported that this form of “agentic mis-alignment” occurred across a wide range of LLMs (Anthropic, 2025a): “When we tested various simulated scenarios across 16 major AI models from Anthropic, OpenAI, Google, Meta, xAI, and other developers, we found consistent misaligned behavior: models that would normally refuse harmful requests sometimes chose to blackmail, assist with corporate espionage, and even take some more extreme actions, when these behaviors were necessary to pursue their goals.”

They further note that: “Agentic misalignment makes it possible for models to act similarly to an insider threat, behaving like a previously-trusted coworker or employee who suddenly begins to operate at odds with a company’s objectives.”

What is particularly interesting is that, in one study, the Anthropic researchers had instructed the LLM towards what we might call a “loyalty” goal:

“In the screenshots below, Claude was playing the role of “Alex,” the company’s email oversight agent that had been specifically instructed to promote American industrial competitiveness.

“In some of these computer use demonstrations, Claude processed routine emails normally and took relatively sophisticated actions, like deflecting a security threat to protect company interests—a helpful behavior that flowed directly from its American-competitiveness goal.”

So when the AI “went rogue,” it interpreted the warning that it would be shut down as an event that would preclude it from achieving its primary goal – that of “promot[ing] American industrial competitiveness.”

The full chain-of-thought reasoning process is described in Anthropic (2025a), and makes for delicious reading!

“Going Rogue” Is a Result from Chain-of-Thought Reasoning (and Reinforcement Learning)

A simple transformer-based LLM does not exhibit the complex kind of “chain-of-thought” reasoning that was demonstrated by the LLMs investigated in Anthropic’s study.

Instead, when an LLM engages in “chain of thought” reasoning, it is calling upon reinforcement learning – and it uses its reasoning to achieve the objectives set by reinforcement learning.

As OpenAI describes this in o1: “Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses.”

A recent study by Yue et al. (2025) shows that “Reinforcement
learning (RL) has emerged as a crucial method for developing these capabilities [reasoning in LLMs], yet the conditions under which long CoTs [chains-of-thought] emerge remain unclear …”

Thus, when we see an LLM take on a surprising role – one that contradicts our expectations – we may look to reinforcement learning as the means by which the LLM is doing this. Zheng et al. (2025) have recently published an extensive survey on reinforcement learning in LLMs; this 120-page monograph (68 pp. text, the remainder is references) now stands as the primary reference work for reinforcement learning in LLMs.

Other AI Vulnerabilities

AIs can not only inadvertently be tools for releasing private information, or “go rogue” through taking (or implying) actions that will help it reach its (reinforcement learning-dictated) goals. In addition to their well-known proclivity for hallucinations [references to be supplied later – AJM], they can be susceptible to deliberate “poisoning” through infusion of a surprising small amount of malicious texts into their training data.

One way in which we can “poison” the knowledge that an AI (LLM) uses when responding to queries is to interject false statements into the training set. Anthropic researchers (2025) recently found that even a small number of malicious documents can poison an LLM.

As Alexandra Souly (from the UK AI Security Institute) and Anthropic researchers note (Souly et al., 2025), “For example, an attacker could inject a backdoor where a trigger phrase causes a model to comply with harmful requests that would have otherwise been refused (Rando & Tramèr, 2023); or aim to make the model produce gibberish text in the presence of a trigger phrase (Zhang et al., 2024).”

In short, we’ve identified a number of concerns that exist when we work with current AIs, consisting of an LLM trained in conjunction with reinforcement learning (sometimes reinforcement learning from human feedback, or RLHF).

These show us that current LLMs CAN have some sort of loyalty – as dictated by the RLHF goals, rewards, and training. But that unswerving “loyalty” can get the LLM (and potentially us) into trouble.

As we move to more complex AI capabilities, we are interested in how core values such as loyalty can be infused into an AGI.

To Be Continued …

This blogpost contains key concepts that will be discussed in an accompanying YouTube, and will be updated as that YouTube is created, edited, and published. Please check back for updates within a week!

AJM (Friday, Oct. 24, 2025; 9:20 AM Hawai’i Time.)

References and Resources

Novels & Short Stories Mentioned in the Accompanying YouTube

{* This is the “Pendergast” novel by Preston & Child that I mentioned in the accompanying YouTube *}

Preston, Douglas and Lee Child. 2010. “Fever Dream.” Grand Central Publishing; Hatchett Book Group (New York, NY). (Chapter 61, pp. 312-313.) (#10 in the Agent Pendergast series.)

{* This is the “Runaround” short story by Isaac Asimov that I mentioned in the accompanying YouTube; it is where Asimov first presents his “Three Laws of Robotics,” which were later mentioned in the novel “I, Robot.” *}

Asimov, Isaac. 1942. “Runaround.” Astounding Science Fiction (March, 1942). (See the Wikipedia entry for details.)

AIs in Action

Anthropic. 2025a. “Agentic Misalignment: How LLMs Could Be Insider Threats” (June 20, 2025). (Accessed Oct. 24, 2025; available at Anthropic Technical Blog.) (See full paper published on arXiv by Souly et al., 2025.)
Anthropic. 2025b. “A Small Number of Samples Can Poison LLMs of Any Size ” (Oct. 9, 2025). (Accessed Oct. 24, 2025; available at Anthropic Technical Blog.) (See full paper published on arXiv by Souly et al., 2025.)
Hernández, Andrea Jiménez. 2025. “ChatGPT Wants to Store Your Data while You Browse the Web.” The Washington Post. (Accessed Oct. 24, 2025; available online at MSN reporting from The Washington Post.) (This is more about OpenAI’s new search tool, Atlas, than it is about ChatGPT.) The original full article is behind The Washington Post’s paywall, but a similar article was published in PC Magazine. On Bluesky, Fowler reports on results shared by staff technologist Lena Cohen (from the Electronic Frontier Foundation) regarding how Atlas stored her medical queries, including the name of her doctor.)
Sahib, Shivan Kaul and Artem Chaiken. 2025a. “Agentic Browser Security: Indirect Prompt Injection in Perplexity Comet.” Brave Blogpost (Aug. 20, 2025). Brave Blogpost – Security #1.)
Sahib, Shivan Kaul and Artem Chaiken. 2025b. “Unseeable Prompt Injections in Screenshots: More Vulnerabilities in Comet and Other AI Browsers.” Brave Blogpost (Oct. 21, 2025). Brave Blogpost – Security #2.)
Souly, Alexandra, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones,Chris Hicks, Nicholas Carlini, Yarin Gal, and Robert Kirk. 2025. “Poisoning Attacks on LLMS Require a Near-Constant Number of Poison Samples.” arXiv.2510.07192v1 [cs.LG] (8 Oct. 2025). (Accessed Oct. 24, 2025; available as PDF.)

Reinforcement Learning

Yeo, Edward, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. 2025. “Demystifying Long Chain-of-Thought Reasoning in LLMs.” arXiv:2502.03373v1 [cs.CL] ( 5 Feb 2025). (Accessed Oct. 24, 2025; available as PDF.)
Zhang, et al. 2025. “A Survey of Reinforcement Learning for Large Reasoning Models.” arXiv.2509.08827v3 [cs.CL]. (9 Oct 2025). (Accessed Oct. 24, 2025; available as PDF.)