Self-Study Guide: Stat Mech and Gen AI

There are several different kinds of generative AI, but they all share three key elements. This may be surprising, since when we read some generative AI (Gen-AI) papers, we might find that one or another of these elements is not mentioned at all, or is hidden deep in the text.

Figure 1. There are several forms of generative AI; the ones on the left-hand-side use neural networks, and the ones on the right allow for a more general model.

Some Gen-AI systems use neural networks (e.g., the restricted Boltzmann machine or transformers, shown on the left in Fig. 1), and some do not (e.g., variational inference and active inference, shown on the right). Nevertheless, all gen-AI systems have three key elements in common, which are shown in Figure 2.

Figure 2. All generative AI systems have three elements in common: (1) the Reverse Kullback-Leibler divergence, (2) Bayesian conditional probability, and (3) statistical mechanics – specifically, minimizing the free energy function.

All generative AI systems do the following three things:

Compare a model against a data set (observations); the model has a set of tunable parameters, and the goal is to adjust the parameters to bring the model as close to the data as possible, and – IMPORTANT NOTE: this is the REVERSE Kullback-Leibler divergence, where we adjust the model parameters; in the regular or non-reversed Kullback-Leibler, we compare data or observations against a model to determine how far the data is from the model,
Actually, the comparison is based on the latent space representations of the observations, these latent space representations (“latent variables”) are conditionally dependent on the observations – this is where Bayesian probabilities come in, and
Do some fancy math and transform the above into an equation that looks exactly like a free energy equation from statistical mechanics, so we set about minimizing this free energy.

We can usually self-study enough to understand the first two steps on our own. These resources for the first two elements will let someone be familiar with the concepts within a few hours:

The (Reverse) Kullback-Leibler Divergence (AJM’s tutorial), and
Bayesian probabilities – from the Wikipedia and Jason Brownlee’s tutorial.

Stat Mech – and Why We Need It

It’s that third and final step – transforming the equations into the statistical mechanics (stat mech) version of a free energy equation – that throws most people off.

The reason why using stat mech is important is that the free energy function from statistical mechanics is a linear combination of two terms – and the second term, the negative entropy, has a strong minimum.

Figure 4 shows a very simple version of a free energy function, with a simple linear term (in blue, below) added to the U-shaped negative entropy (red curve); by adding the two together, we still get a curve (shown in purple) that has a clear and well-defined minimum.

Adding in another term, exemplified by the blue line for the linear case of an enthalpy term, shifts the free energy minimum – but it is still easy to drive the system towards the free energy-minimized state (purple diamond in Figure 4).

Figure 4. This figure shows a simple combination of a linear enthalpy term (on the top, shown in blue) that is added to the negative entropy term (on the bottom, shown in red, where the minimum is shown as the red diamond). The resultant free energy curve, shown in purple, still has a well-defined minimum, shown as the purple diamond.

Thus, when we minimize the free energy, we bring our system to a stable solution and solve the problem.

If This Is So Easy, Why Does It Seem So Hard?

There are three reasons why self-study on the stat mech is so intimidating for many people:

The available stat mech teaching materials are rooted in physics (or physical chemistry), and the derivations have a lot to do with physics/physical chemistry concepts such as temperature, pressure, and volume – which need to be stripped out before the resulting derivations are at all useful – this is extra work and requires knowing what a person is doing,
The necessary concepts link together in a very specific way, so that learning them is like traversing through a mountain range of concepts – we have to get from one to another in logical order, and often these are taught without a clear high-level “topographic view” (see Fig. 5 below), so it’s hard to know where we’re going, and
The concepts involved are just plain WEIRD. They’re not so much hard, as they are a bit strange – and while the math is not that difficult, the trick is to understand how the various math elements fit together.

Here’s the topographic overview for the key stat mech concepts.

Figure 5. The key stat mech concepts that we need in order to understand gen-AI are connected in a very specific manner, much like how mountains in a mountain range connect specifically to each other. We need to learn these concepts in the right order, and make the important connections.

Bottom line: the people who could most readily explain this are either physical chemists (which is what I am), or physicists. These people typically live in a very rarified atmosphere. They don’t want it to be high-school-simple, they want beautiful, elegant equations.

As US Army Brigadier General Anthony McAuliffe said, when asked by Nazi forces to surrender during WWII, “Nuts!”

We can do better than fancy equations.

We can understand this – and make it so simple that we can give it to high school students and THEY can understand this.

It’s just a matter of having the right stories.

Why This Matters to You Now (Imminent AGI)

We’re on the cusp of the “great transition” – from simple Gen-AI to AGI.

AGI is going to be more complex. There will be more design decisions. Each time we make a design decision, we need to know what’s behind it. For example, are we going to use Yann LeCun’s LeJEPA (NOT a generative method), or are we going to use something that’s generative? Why and why not? What’s the rationale for each decision?

Even if we, ourselves, are not going to be the AGI designers, we need to know how a specific AGI is constructed, if we’re going to build on it and use it in our business.

This is more than just deciding between an Audi and a Volvo. (Both good cars, but same basic idea.) It’s more like deciding to go by plane or by train. Very, VERY different.

So we need to know the basics, far more than ever before.

And that means just a little tiny bit of statistical mechanics, or stat mech.

A Little Backstory (in Brief)

I’ve taught AI at Northwestern University since 2016. And just a few years into that, I realized that I didn’t understand Gen-AI all that well. That was DESPITE my having written a book on neural networks way-back-when. (1990, for those of you who care.)

But there I was, struggling.

I was trying to teach myself active inference (a form of variational inference) from Karl Friston’s papers, and as anyone who’s tried to read ANY of Friston’s work knows – that’s a very tough job!

And Friston kept saying that active inference is a form of generative AI.

That meant, it had a lot in common with Boltzmann machines (restricted and other). And a lot in common with transformers (although you’d NEVER guess that by reading the original Vaswani, 2017 “Attention is All You Need” paper).

So I had to backtrack, and figure out: what makes a Gen-Ai system, well … GENERATIVE?

And over Winter Quarter, 2023, I did a lot of studying. A lot of reading of diverse papers, looking for common themes.

It took a while, but I found them – which led to Figure 2 of this blogpost.

But then I realized … if I wanted to teach AI, and if I wanted anyone to ever use MY inventions (there’s always a personal reason, right?), people would need to have a bit of stat mech vocabulary.

And if I encouraged them to read any of Karl Friston’s papers (because I think active inference is important), they’d need some of that stat mech vocabulary.

And if they wanted to read a Yann LeCun paper, such as the 2022 paper where he introduced JEPA (Joint Embedding Predictive Architecture), and had a lot of pointed comments to make about deep learning (a form of Gen-AI), people would need (you guessed it), a bit of stat mech vocabulary.

JUST SOME VOCABULARY.

NOT a full semester (or even quarter)-length course in graduate level statistical mechanics.

Just enough vocabulary to read some really crucial papers without tripping over important terms. Terms such as “partition function.”

We never actually CALCULATE a partition function – but references to it show up so much that if we don’t understand it, we feel a bit – unsure of ourselves.

Or, understand how Hinton and colleagues were talking about “data-dependent” and “data-independent” parts of their algorithm. (It turns out, this relates to free energy and other stat mech ideas.)

Or … so many other papers. Ones that form the backbone of AI, as we know it today – and will continue to be important as we build AGI.

So I came up with the idea of a short vocabulary course, the Top Ten Terms in Statistical Mechanics.

Figure 6. The Top Ten Terms in Statistical Mechanics was originally designed as a two-week (ten-day) ***vocabulary course***.

But then, as great ideas do, it sort of grew. And GREW, and GREW. Until it morphed into a three-week course, with Bonus Materials.

My Christmas Present to You

My Christmas present to you is a collection of useful resources, organized and collated.

And since it is easier for me to write than it is to go back and edit and organize things, this is really a HUGE gift. From me to you.

START HERE.

YouTube Link #1: Maren, Alianna J. 2021. “Statistical Physics: Foundational to Artificial Intelligence.” Themesis, Inc. YouTube Channel (April 26, 2021). (Accessed Dec. 26, 2025, available online at Stat Phys: Foundational.)

If you want more, if you want to dive deep, if you want to study stat mech in a structured approach and have me on hand for three weekly Zoom sessions (set up for each Cohort going through), then yes. Do the course. Enroll in the Top Ten Terms.

As an extra bonus – to make this on a par with Northwestern courses – if you come in with an AI paper that you want me to review, you GET THAT REVIEW during the course. (Yes, it HAS to be within three weeks of enrollment.)

But that kind of review? It’s a major investment in YOURSELF.

Just being in a course with others and attending the group sessions is hugely sanity-producing, confirming that you really ARE getting what you need to understand.

But if you’re a brave and stalwart soul, and determined to do this on your own, I’ve collected EVERYTHING that I’ve put out – over the past four years of teaching myself this – and organized it for you.

Sanity.

In a nutshell.

Starting Place: Stat Mech is a Metaphor, Not a Model

This was one of my first big insights, or “aha!” moments.

Statistical mechanics is, in itself, a model. It’s a model of little particles (atoms) bouncing around in a space (a volume), at a particular temperature and pressure.

It’s simplified – as in, these little particles are assumed to have no volume (themselves) and no mass.

For things in the physical world, this is a beautiful model, and it explains so much.

For things in the AI world, it is not a model – because there are no little particles bouncing around.

There is a neural network (or more generally, a model) that – if you finesse the equations just right – look a whole lot like the free energy equation from stat mech.

And this is the entire premise of why we use statistical mechanics in AI, AT ALL

So in the AI world, we use stat mech as a metaphor. Almost as a poetic allegory.

Here’s the most relevant “stat mech as metaphor” YouTube vid:

YouTube Link #2: Maren, Alianna J. 2021. “The AI Salon: Statistical Mechanics as a Metaphor.” Themesis, Inc. YouTube Channel (April 26, 2021). (Accessed Dec. 26, 2025, available online at Stat Mech as a Metaphor.)

This was enough for Hopfield to figure out the Hopfield neural network. (Actually, some other top researchers, such as William Little and Shun-Ichi Amari, were onto this notion some years before.)

And then Geoffrey Hinton took things one step further by introducing the notion of latent variables, and we got the Boltzmann machine.

And this led to the whole Gen-AI revolution, and Nobel Prize awards (Physics, 2024) to Hopfield and Hinton.

Start HERE, with this YouTube #short on the beginnings of Gen-AI with Hopfield and Hinton:

YouTube Link #3: Maren, Alianna J. 2023. “The Ising Equation Lets You Understand All Energy-Based Neural Networks #shorts.” Themesis, Inc. YouTube Channel (Jan. 23, 2023). (Accessed Dec. 26, 2025, available online at Hopfield-and-Hinton.)

To round this out, you can get a one-minute overview of how Hopfield and Hinton’s work gave us the Boltzmann machine in THIS YouTube #short:

YouTube Link #4: Maren, Alianna J. 2023. “The Boltzmann Machine: The Most Important Energy-Based Neural Network #shorts.” Themesis, Inc. YouTube Channel (Jan. 30, 2023). (Accessed Dec. 26, 2025, available online at Boltzmann Machine.)

Getting Grittier: Some Hinton Vocabulary

Have you ever tried to read a Hinton paper? How about that classic Salakhutdinov and Hinton (2012) paper on “Deep Boltzmann Machines”?

It is so INSANELY difficult! The first paragraph alone stops most people – and it’s all vocabulary. As in, WHAT does he mean by THAT???

That’s what led me to making this vid – which is actually part of the Bonus Material for the Top Ten Terms, but then I published it for EVEYRONE. And this isn’t all about the stat mech, it’s more … all the terms that Salakhutdinov and Hinton use in the FIRST PARAGRAPH of their paper!

YouTube LInk #5: Maren, Alianna J. 2023. “How to Learn Energy-Based Neural Networks.” Themesis, Inc. YouTube Channel (Dec. 20, 2023). (Accessed Dec. 26, 2025, available online at How to Learn.)

A Very Important Distinction (MLPs and Boltzmann Machines)

One of the most crucial things to understand, as you progress in your knowledge of Gen-AI neural networks (specifically the Boltzmann machine) is: How does a (restricted) Boltzmann machine (RBM) relate to a Multilayer Perceptron (MLP)?

This is SO essential.

And most people really don’t get it – because if you draw things right, an MLP and a (restricted) Boltzmann machine look as though they have the same kind of architecture.

And BOTH have been used – very successfully – as classifiers.

This means that it’s easy for us to think of them as being “sort-of-the-same.”

I fell into that trap (one of my MANY bloopers) for years. Years and YEARS.

And it wasn’t until I really started to study generative Ai that I finally got the insight.

The Video

This is a MUST-WATCH VID. (The good stuff starts about halfway through.)

YouTube Link #6. Maren, Alianna J. 2024. “How to Understand Generative AI Using the Top Ten Terms in Statistical Mechanics.” Themesis, Inc. YouTube Channel (Jan. 7, 2024). (Accessed Dec. 26, 2025, available online at How to Understand.)

The Associated Blogpost (Important!)

After watching the vid, this blogpost will make the difference between MLPs and RBMs so much clearer! (And it has the figures used in the YouTube, so you can study them at leisure!)

Maren, Alianna J. “When a Classifier Acts as an Autoencoder, and an Autoencoder Acts as a Classifier (Part 1 of 3).” Themesis, Inc. Blogpost Series (April 21, 2022). (Accessed Dec. 26, 2025, available online at When a Classifier …“

{* To be continued – AJM. Friday, Dec. 26, 2025 *}