AIs Acting Out: Crazy and Malicious

Our primary focus – not just for these next few months, but likely for the next few years – will be artificial general intelligence.

That said, it’s useful to assess the landscape – because evidence suggests a real sea change in how people are viewing AI.

We’ll highlight some of the recent works identifying weaknesses and flaws in transformer-based AI methods – and also examine measurement strategies.

Yet an entirely different kind of AI shows its tremendous potential for not just hallucinations (which we can regard as unintentional), but also for outright acts of duplicity. These are intentional deceptions, and are a result of the AI simply following its internal goal-achieving strategy.

Transformer-Based AI: Inherent Limitations

Three levels of problems, with growing levels of concern:

Problem Level 1: AIs hallucinate. (One team is bluntly describing this as “bullshitting.”)
Problem Level 2: There’s really no way to fix hallucinations.
Problem Level 3: Really bad stuff can be hidden inside a transformer-based AI, and it can’t be excised.

Problem #1: LLMs as “Bullshitters”

The most well-known problem with LLMs is that they generate content that is not factually grounded. LLMs make stuff up.

We all know this, but here are a few blunt summaries of this well-known truth.

Al-Sibai, Noor. 2024. “Researchers Say There’s a Vulgar But More Accurate Term for AI Hallucinations.” Futurism (June 10, 2024). (Accessed Jun 18, 2024; available at MSN.com.)

The paper that Al-Sibai discusses is by Hicks, Humphries, and Slater (2024):

Hicks, M. T., Humphries, J., & Slater, J. 2024. “ChatGPT Is Bullshit.” Ethics and Information Technology, 26(2), 38.

This paper is behind a “for-pay” firewall with Springer. However, the Abstract is:

The Abstract: Recently, there has been considerable interest in large language models: machine learning systems which produce human-like text and dialogue. Applications of these systems have been plagued by persistent inaccuracies in their output; these are often called “AI hallucinations”. We argue that these falsehoods, and the overall activity of large language models, is better understood as bullshit in the sense explored by Frankfurt (On Bullshit, Princeton, 2005): the models are in an important way indifferent to the truth of their outputs. We distinguish two ways in which the models can be said to be bullshitters, and argue that they clearly meet at least one of these definitions. We further argue that describing AI misrepresentations as bullshit is both a more useful and more accurate way of predicting and discussing the behaviour of these systems.”

A similar paper – one that is ALSO behind a for-pay firewall – is:

Hannigan, Timothy R., Ian P. McCarthy, and André Spicer. 2024. “Beware of Botshit: How to Manage the Epistemic Risks of Generative Chatbots.” Business Horizons (March 20, 2024). (Available online – FOR PAY – at: Business Horizons.)

The Abstract for this similar paper is also available; it is HERE.

Advances in large language model (LLM) technology enable chatbots to generate and analyze content for our work. Generative chatbots do this work by ‘predicting’ responses rather than ‘knowing’ the meaning of their responses. This means chatbots can produce coherent sounding but inaccurate or fabricated content, referred to as ‘hallucinations’. When humans use this untruthful content for tasks, it becomes what we call ‘botshit’. This article focuses on how to use chatbots for content generation work while mitigating the epistemic (i.e., the process of producing knowledge) risks associated with botshit. Drawing on risk management research, we introduce a typology framework that orients how chatbots can be used based on two dimensions: response veracity verifiability and response veracity importance. The framework identifies four modes of chatbot work (authenticated, autonomous, automated, and augmented) with a botshit related risk (ignorance, miscalibration, routinization, and black boxing). We describe and illustrate each mode and offer advice to help chatbot users guard against the botshit risks that come with each mode.”

And here’s a bit of background on the now-famous case where lawyer Schwartz used ChatGPT to do his “legal research,” and ChatGPT generated six completely bogus cases:

Moran, Lyle. 2024. “Issues beyond ChatGPT Use Were at Play in Fake Cases Scandal.” LegalDive (June 12, 2023). (Available online at LegalDive.)

Problem #2: We Can’t Fix Problem #1

The real problem seems to be that we can’t fix an LLMs tendency to make stuff up.

The real problem is that this “making stuff up” is inherent in the algorithm.

I could go on … and on and on … but we pretty much know this already, don’t we?

So … a few articles that deep-dive into why we CAN’T fix LLMs.

First, a fairly scholarly approach.

This article, contributed by Robert G., shows that the tendency to create hallucinations is embedded in the generative algorithm.

Xu, Ziwei, Sanjay Jain, and Mohan Kankanhalli. 2024. “Hallucination is Inevitable: An Innate Limitation of Large Language Models.” arXiv:2401.11817v1 [cs.CL] 22 Jan 2024. (Accessed Jun 18, 2024; available from arXiv.)

Their results, briefly, are that “hallucinations are inevitable.”

… we present a fundamental result that hallucination is inevitable for any computable LLM, regardless of model architecture, learning algorithms, prompting techniques, or training data. Since this formal world is a part of the real world, the result also applies to LLMs in the real world.”
Xu, Jain, and Kankanhalli, 2024.

Problem #3: It’s Worse Than We Think

“Jailbreaking” – forcing an LLM to deliver material that it’s not supposed to divulge – is a real thing. Anthropic recently found this to be the case, as reported in TechCrunch, and with a full research paper released by Anthropic on “many-shot jailbreaking.”

Key point from the TechCrunch article: “Why does this work? No one really understands what goes on in the tangled mess of weights that is an LLM …”

Further (again due to Anthropic’s research), it is possible to embed “poisons” into LLMs that cannot be dislodged.

In his summary of Anthropic’s paper, Jesus Rodriguez states that “Their [Anthropic’s] work reveals that these backdoored models can maintain their hidden behaviors even when subjected to advanced safety training methods.”

Here’s the original Anthropic paper on “sleeper agents”, produced by Evan Hubinger and 38 other authors:

Hubinger, Evan, et al. 2024. “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.” arXiv:2401.05566v3 [cs.CR] 17 Jan 2024. (Accessed Jun 18, 2024; available from arXiv.)

On a different but related topic: THIS ONE on Red Teaming:

Red Teaming AI

Intentional AI Duplicity

Now, for an entirely different take on “bad behavior in AIs:” recent experimental work by Meta, on their CICERO AI, which is trained to play the game of Diplomacy, revealed that:

An AI has no conscience.

A deliberate pattern of creating alliances and then betraying allies is a strategy used by Meta’s CICERO AI, as discussed by Park et al. (2024).

Park, Peter S., Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks. 2024. “AI Deception: A Survey of Examples, Risks, and Potential Solutions.” Patterns (May 10, 2024) 5(5): 100988. doi:10.1016/j.patter.2024.100988. (Accessed Jun 18, 2024; available from Patterns.)

But … let’s pull back and stop anthropomorphizing, ok?

Our expectation that an AI should have a conscience is severely out-of-whack with what’s going on in the algorithm.

If we’re training an AI using reinforcement learning (which is the typical approach with goal-oriented AIs), then we’re training them to take steps to achieve their goals.

An AI doesn’t have a “moral compass,” and can lie to or betray its allies in the pursuit of a goal — in much the same way that a five-year-old child would lie about filching a bag of Oreos.

A five-year-old child doesn’t have a “moral compass.” That’s something that we TEACH children.

Similarly, an AI in its early development stage (and ALL AIs are in “early development”) should not be expected to have the same sense of “moral compass” or “ethics” that we humans have.

Also, at this stage of AI development, there’s nothing inside an AI that could form the basis for a conscience.

An AI simply has a goal-achievement mechanism. It will achieve its goal by whatever means possible. If this means behaviors that we humans would find “deceitful” – well, right now, AI’s don’t know what it means to be deceiving. Hence … no conscience.

So … this sense of “morality” or the “right thing to do” is something that we’ll have to build into AIs. We shouldn’t expect it to just show up on its own — and of course, it doesn’t.

Lee G. Asked This Question Two Months Ago

Two months ago, my colleague Lee Goldberg and I were preparing for a presentation at the Trenton Computer Festival.

The question that Lee posed was: “Can an AI have a conscience?”

Here’s the YouTube from that presentation.

I followed up with a separate YouTube a few weeks later: same basic ideas, same slides – you just get to see my smiling face a lot more.

At that time, I was fairly sure that not only could it not (in a real sense), but that – right then, it wasn’t that urgent an issue.

Just two months ago … I would have said that the need for an AI to have a conscience was way down the road.

I might have been wrong.

So let’s talk about this. This is where we need to be in the same room (even if virtually), talking with other adults who are bringing a thoughtful perspective to the issue.

We’re at the stage where just reading (interminable) blogposts, listening to YouTubes and podcasts, and even reading the research papers is not enough … it produces a jumble of thoughts, rather than clarity.

Clarity comes via conversation.

Thoughtful, in-depth conversations.

Not always deeply technical … but thoughtful, well-informed, and grounded in reality.

Hallucinations: Salon Topics

In last month’s Themesis AI Salon, our kick-off topic was indeed that question: Can an AI have a conscience?

One of the healthiest, most thoughtful, and invigorating discussions that we’ve EVER had.

This month’s Salon (Sunday, June 23) will be devoted to “AIs Acting Out: Hallucinations and Betrayals.” But … given the nature of our free-flowing discussions, I’m sure that we’ll pick up on this issue of “devious behaviors” from an AI.

How to Be Invited to Salon

If you haven’t yet joined the Themesis community, do so – because you need to be on our mailing list to get an invitation to join our membership-only Salon.

Go to: https://themesis.com/themesis/ – that’s our “About” page (you can see the menu tab above), click, scroll, find the Opt-In form … and you know how to do the rest.

You’ll get emails as we announce future Salons, flash sales for our course(s), and more!

Resources and References

Williams, Rhiannon. 2024. “AI Systems Are Getting Better at Tricking Us.” MIT Technology Review (May 10, 2024). (Accessed May 10, 2024; available at MIT Tech Rev.)

Adam Schiff (House of Representatives, CA) Press Release. 2024. “Rep. Schiff Introduces Groundbreaking Bill to Create AI Transparency between Creators and Companies.” Adam Schiff website (April 09, 2024). (Accessed May 10, 2024; available at Schiff Bill.)