Kullback-Leibler, Etc. – Part 2.5 of 3: Black Diamonds

We need a “black diamond” rating system to mark the tutorials, YouTubes, and other resources that help us learn the AI fundamentals.

Case in point: Last week, I added a blogpost by Damian Ejlli to the References list. It is “Three Statistical Physics Concepts and Methods Used in Machine Learning.” (You’ll see it again in the References section, near the end.)

Very good post. Insightful. Well-written.

And absolutely at the “diamond” level.

Experts only.

Just as an illustration, take a look at Ejlli’s post – and I’m saying this with respect, because it is very good.

The thing is – the prereqs for reading this one (with any understanding) are a Ph.D. in physics. It’s not just the stat mech. It’s also the quantum. And then, of course, a bit of Bayesian.

Darlings, I took two years each of statistical mechanics and quantum physics when I was in graduate school. Absolutely intellectually glorious. A very high-altitude experience.

But how many of us, coming into the AI field, have that kind of background?

A Bit of Talk Story

Indulge me for a moment. I’m going to do what we call (here in Hawai’i) a bit of talk story.

That means, just sort of sharing. Background fill-ins. Life-stuff.

I’m fortunate to have a Ph.D. in chemistry. Actually, in theoretical physical chemistry.

My daddy was a chemist – also in theoretical physical chemistry. (Apples and trees, that sort of thing.) He was a university professor. Mom was also an educator; local Catholic parish grade school. So I come about this immersion into teaching people via a family lineage.

Some seven years ago, shortly after I’d joined the faculty in Northwestern University’s Master of Science in Data Science program, I told our Program Director – Dr. Tom Miller – that I’d just published a paper. “Neural networks, statistical physics, that sort of thing,” I think I’d said.

He responded with an invitation to put together a Special Topics course in Artificial Intelligence and Deep Learning.

I did, the course ran, and ran three times within one year. (Unprecedented, but we had an over-the-top enrollment each time.)

And thus, our MSDS “Artificial Intelligence Specialization” was borne. (We now have multiple courses, several faculty, etc.)

The thing is – as I was re-educating myself on AI, neural networks, deep learning, and all related topics – I realized that the field had shifted since I was last actively involved.

The shift – with the emergence of deep learning – was very much towards energy-based neural networks. The Boltzmann machine (restricted or not), its forebear (the Little/Hopfield neural network), and then all things “deep.” Multiple layers, diverse architectures.

These all had one thing in common: a statistical mechanics foundation.

I Was Lucky

Call it kismet. Call it karma. Call it luck.

Whatever it was, I drew the “lucky straw,” because my stat-phys background (rusty though it was) let me get back up to speed. Fairly quickly, in fact.

Mostly, I had lingering memories of “something that I needed to learn all over again.” It helped me focus my search and re-education.

I realized that most people entering the field were not so lucky.

I came across this Quora post, which I’m copying from my personal website. (My post, on my book-in-progress, gives the link: https://www.aliannajmaren.com/book/ )

How can I develop a deep/unified view of statistical mechanics, information theory and machine learning?”
“I learned all three fields independently (from physics, EE and cs), and many of the same concepts show up in all of them, some times with different meaning and views, like: entropy, maximum entropy model, Ising model, Boltzmann machine, partition function, information, energy, mean field, variational methods, phase transition, relative entropy, coding/decoding/inference, etc, etc. I felt that my understanding of those concepts are broken, lacking an unified view.”
Post on Quora, https://www.quora.com/How-can-I-develop-a-deep-unified-view-of-statistical-mechanics-information-theory-and-machine-learning-I-learned-all-three-fields-independently-and-many-of-the-same-concepts-show-up-in-all-of-them-I-feel-my-understanding-lacks-a-unified-view

This was Quora-posted in 2017. Now, five years later, it’s still an extremely relevant question.

Many of us, coming into the AI/ML (machine learning) field may have understanding of one or two of the lead-in disciplines, but rarely all of them.

Also, what is particularly challenging is to make the connection between the foundational disciplines and their role in AI/ML.

Often, the important equations stand as a metaphor; sort of a model-that-leads-to-a-model.

In the case of that Quora-question, there are now 37 answers.

I didn’t post an answer, because I felt that it deserved a more comprehensive approach.

Instead of posting a short answer, I started writing a book.

More recently, I’ve switched to more bite-sized and accessible pieces, such as blogs and YouTubes.

The challenge remains, though.

We need to understand the fundamentals, and then we need to pull them together into the AI/ML models that we use today.

Introducing “Black Diamonds”

All of this brings me to today’s topic; “black diamond” ratings for online materials.

Let’s start with the basic notion of our journey into AI/ML as the “Oregon Trail of AI.” I’ve used this analogy multiple times; for good measure, here it is again.

In this analogy, we all (collectively) start our AI journey in a metaphorical “Elm Grove, Missouri.” We typically get as far as a metaphorical “Fort Laramie, Wyoming.”

CAPTION: Figure 2. The “Oregon Trail” of Artificial Intelligence, with TWO Donner Passes. The first (traditional) one is converging disciplines to understand energy-based neural networks (e.g., the Boltzmann machine) and all manners of “deep” neural architectures. The second (new) one is further towards the California coast, and is the convergence of the same disciplines, but this time yielding variational methods – which are the ever-evolving “Gold Coast” of California. **This figure appeared earlier in:** https://themesis.com/2022/06/28/the-kullback-leibler-divergence-free-energy-and-all-things-variational-part-1-of-3/

Getting to this metaphorical “Fort Laramie” means:

You understand backpropagation – you’ve gone through the equations, you know how it works, and you’ve got a notion of how it is a “discriminative” neural network.
You understand that there is something called generative neural networks – that these comprise the (simple or restricted) Boltzmann machine, and that these are used in deep learning architectures. But you probably don’t know the equations or fundamentals, other than that they learn to identify the latent variables without needing a pre-labeled training set.
You’ve done some sort of neural network construction, using any combination of tools, and you’ve carried out one or more projects – and typically have used a combination of convolutional neural networks (CNN) together with a long short-term memory (LSTM) network.

And this, typically, is where many of us stall out.

This position is at the outskirts of the “Sierra Nevada mountains of AI/ML.” It’s the metaphorical “Fort Laramie.” This is where settlers on the Oregon Trail would stop to replenish, regroup, and just take a moment before heading out on the next – much more difficult – part of their journey.

Going further requires going into the mountains – going up in altitude, so to speak – and working with statistical physics (statistical mechanics).

Many people in our field start out on that journey.

Unless, however, they are guided by extremely skilled trail guides (e.g., at Stanford, MIT, and the like), they get into these mountains and die in the Donner Pass of AI.

It’s Easy to Get Lost

Several years ago, I had a student who came to me, frustrated and confused. He’d started going into the energy-based materials on the web – there were some YouTube lectures, etc.

And he couldn’t understand that material.

Now this was a VERY BRIGHT guy.

He was hard-working, determined, and very ambitious.

Thus, it was HUGELY DISTRESSING to him that he couldn’t make headway.

More to the point, it was coming down to a “what is wrong with me?” type of question.

This is true for many of us over-achievers.

We think – with the plethora of materials available – that we should be able to make headway on our own.

Identify terms, and start tracking them down. (Wiki summaries, by the way, are just NOT a good place for self-education; they’re written by physicists FOR physicists.)

Find YouTubes. Watch them.

But by now, there are so many YouTubes – on ALL of these topics – that we get lost in a metaphorical wilderness just as we start to ascend the slopes.

The Real Problem

The real problem, then, is not in finding materials.

It’s figuring out which materials are at the level that we need.

Continuing with our “Oregon Trail” metaphor – using Google search and (similar algorithm) YouTube search, we can rapidly find sources.

But this is like parachuting into the mountains.

We know that these are mountains, but until we’re actually in them, we really don’t understand what that means.

Meaning, we can pick up a YouTube, blogpost, or even something labeled “tutorial.” However, it might be a “tutorial” designed for someone who already has a Ph.D. in physics.

That leaves many of us giving up in frustration, confusion, and despair.

What’s worst is – we start to blame ourselves.

We Need Bunny Slopes

What we DO NOT need is to (metaphorically) parachute into the highest elevations, and promptly die due to cold, exhaustion, and altitude sickness. Or wander around, getting more and more lost. (And discouraged. And then blaming ourselves.)

What we DO need are “bunny slopes.”

We need tutorials that start off at the simplest possible level, and just get us through the mountain pass.

We need to NOT go high up on a slope – unless we really, absolutely HAVE TO – and most of the time, we don’t.

We need clear, simple, well-marked “bunny trails.”

*CAPTION: Figure 3. We need to start our statistical mechanics journey at the metaphorical “bunny slopes.”*

A Good Bunny Starting Place

Let’s go back to that Quora post from 2017. I’m going to list the terms that the poster identified:

Entropy,
Maximum entropy model,
Ising model,
Boltzmann machine,
Partition function,
Information,
Energy,
Mean field,
Variational methods,
Phase transition,
Relative entropy,
Coding/decoding/inference,
Etc, etc.

You and I can add more terms but … let’s not.

Let’s just take this as an illustrative starter list.

What this tells us, right away, is that the first thing that we need is a handle on the terminology.

I’m not about to get into providing definitions for all of these today – BUT – we can do something, very fast.

We can organize these terms by discipline.

Statistical Mechanics:

Entropy,
Ising model,
Partition function,
Energy,
Mean field,
Phase transition

This isn’t a complete list of all the stat-mech terms that we’d like, but it’s a decent start.

Information Theory / Bayesian Probability:

Entropy (Note that I’m including it again here, even though it is also in the stat-mech list),
Maximum entropy model,
Information,
Relative entropy,
Coding/decoding/inference (“Inference” really belongs elsewhere; the original poster lumped it in with coding/decoding.)

Again, not a complete list – and not well-sorted by sub-domain, but still a decent start.

Neural Networks and Machine Learning Methods

Boltzmann machine,
Variational methods

Now that we’ve made a preliminary start on term-organization, we see that “Boltzmann machine(s)” and “variational methods” are stand-ins, each for a broad area of both investigation and modeling methods.

So – we’ve achieved a first-pass rough-chunking.

This tells us that if we want to get to our models, we FIRST need statistical mechanics, and THEN need all the info-theory stuff. There are two distinct sets of “fundamentals” here.

And yes, someone could start off with the Bayesian, etc., and try to factor in the stat-mech as a second step – but I think that would get a person cross-ways, because very early in Bayesian/info-theory, we start talking about entropy, and there is just no point in talking about entropy until we’ve got some stat-mech under our belts.

Once we’ve got both sets of fundamentals, stat-mech and the Bayesian/info-theory, then we can start in on the Boltzmann machines and the variational inference.

In fact, if we add in Kullback-Leibler (needed for the variational methods), we’ve just created the Donner Pass map that we’ve been using all along.

Just for reference, here it is again.

The following Figure 4 illustrates the two key “Donner Pass” areas; you can see that they are convergences between statistical mechanics (the free energy notion) and Bayesian probabilities.

CAPTION: Figure 4. There are two “Donner Pass” situations when self-studying your way through advanced AI. The first (“Donner Pass #1) is the convergence between free energy (from statistical mechanics), Bayesian probability, and neural networks (not shown in the figure), resulting in Boltzmann machines. The second (“Donner Pass #2) is the convergence between free energy, Bayesian probabilities, the Kullback-Leibler divergence, and model-optimization (not shown in the figure), to yield variational methods, such as variational Bayes. ***This figure appeared earlier in:*** https://themesis.com/2022/06/28/the-kullback-leibler-divergence-free-energy-and-all-things-variational-part-1-of-3/

Let’s do an imaginary overlay of this “conceptual map” against the “metaphorical map” of the Oregon Trail of AI, shown previously in Figure 2.

We’ve Mapped Our Bunny Trail

Well, we’ve made a start on mapping our “bunny trail.”

What we have is a list of terms for each of the “mountain ranges” that come together in the Donner Pass(es) of AI. (We now have two metaphorical Donner Passes; one for the Boltzmann machine/deep learning, and another for variational methods/inference.)

For the first time, we’ve made a preliminary sketch of what we’d like to learn to traverse each of these “mountain ranges.”

This is a lot like naming the specific mountain peaks.

It’s giving ourselves a set of reference points or landmarks.

Also, as we start mastering each of these terms, one-by-one, we get a sense of the specific progress that we’re making.

We Can Work with This

What we now have is a reasonable starting place.

What we WANT NEXT is a set of term-definitions.

Not too much depth; sort of phrase-book level.

Also, we want to know how the terms connect with each other.

Here’s an example:

*CAPTION: Figure 5. Three key concepts from statistical mechanics. The notion of “microstates” leads to the “partition function,” which leads to “free energy.” (And also to “entropy.”)* Figure from *Top Ten Terms in Statistical Mechanics*, by A.J. Maren for Themesis, Inc. (Preprint.)

Figure 5 shows how three terms (or concepts) link to each other. Once you understand something called “microstates,” you can move on to the “partition function.” Once you understand the “partition function,” you can get to “free energy.” (Also to “entropy,” not shown in this figure.)

This is a lot like saying, “If I can recognize one mountain, then I know what that next mountain peak is, and then the next one.”

In short, you’ve started building up a map of your terrain.

You do not have to go up each and every mountain.

In fact, it’s better if you don’t.

You DO NOT need a lot of fancy, intensive derivations.

The “bunny slope” notion applies.

So does the idea of traversing the mountain range by going through the low-level passes, not going up and down each mountain.

After all, our goal is to get through the mountain ranges, GET THROUGH DONNER PASS, and get to California – where the Gold Coast (all those jobs) awaits us.

We Need a Phrase Book

That’s the next logical step, right?

How about if we start with a phrase book for the statistical mechanics terms?

Let’s call it The Top Ten Terms (that you need to know) in Statistical Mechanics.

Can You Give Us Your Input, Please?

Actually, we’re working on that phrase book.

Almost there.

Before we publish though (and it will likely be in the form of a short course, with a series of associated vids), we’d love to have some input from you.

What do you feel that you really, REALLY need – that would help you get through these metaphorical mountains?

Love to get your input in the Comments!

And thank you!

Have fun, darlings! – AJM

References

NOTE: The references, blogposts, and YouTubes included here will be useful across all blogposts in this Kullback-Leibler / free energy / variational inference series. These references are largely replicated across all the blogposts in the series. (A little more added each week.)

Beal, M. 2003. Variational Algorithms for Approximate Bayesian Inference, Ph.D. Thesis, Gatsby Computational Neuroscience Unit, University College London. pdf.

Blei, D.M. Variational Inference: Foundations and Applications. (Presented May 1, 2017, at the Simons Institute.) http://www.cs.columbia.edu/~blei/talks/Blei_VI_tutorial.pdf

Blei, D.M., A. Kucukelbir, and J.D. McAuliffe. 2016. “Variational Inference: A Review for Statisticians.” arXiv:1601.00670v9 doi:10.48550/1601.00670 (Accessed June 28, 2022; pdf. )

Blei, D.M., Andrew Ng, and Michael Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993-1022. (Accessed June 28, 2022;https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf )

Friston, K.; Levin, M.; Sengupta, B.; Pezzulo, G. 2015. “Knowing one’s place: a free-energy approach to pattern regulation.” J. R. Soc. Interface. 12, 20141383. doi:10.1098/rsif.2014.1383. pdf.

Friston, K. “Life as we know it.” 2013. Journal of The Royal Society Interface. 10. doi:10.1098/rsif.2013.0475 pdf.

Friston, K. 2010. “The free-energy principle: a unified brain theory?” Nature Reviews Neuroscience. 11 (2), 127-138. online access.

Kullback S, and R.A. Leibler. 1951. “On Information and Sufficiency.” Ann. Math. Statist. 22(1):79-86. (Accessed June 28, 2022; https://www.researchgate.net/publication/2820405_On_Information_and_Sufficiency/link/00b7d5391f7bb63d30000000/download )

Maren, Alianna J. 2019. “Derivation of the Variational Bayes Equations” arXiv:1906.08804v4 [cs.NE] (Themesis Technical Report TR-2019-01v4 (ajm).) doi:10.48550/arXiv.1906.08804. (Accessed 2022 Sept. 6; https://arxiv.org/abs/1906.08804 )

Related Blogposts

Maren, Alianna J. 2022. “The Kullback-Leibler Divergence, Free Energy, and All Things Variational (Part 2 of 3).” Themesis, Inc. Blogpost Series (www.themesis.com). (Sept. 15, 2022) (Accessed Sept. 19, 2022; https://themesis.com/2022/09/15/the-kullback-leibler-divergence-free-energy-and-all-things-variational-part-2-of-3/ )

Maren, Alianna J. 2022. “The Kullback-Leibler Divergence, Free Energy, and All Things Variational (Part 1.5 of 3).” Themesis, Inc. Blogpost Series (www.themesis.com). (Sept. 8, 2022) (Accessed Sept. 15, 2022; https://themesis.com/2022/09/08/the-kullback-leibler-divergence-free-energy-and-all-things-variational-part-1-5-of-3/ )

Maren, Alianna J. 2022. “The Kullback-Leibler Divergence, Free Energy, and All Things Variational (Part 1 of 3).” Themesis, Inc. Blogpost Series (www.themesis.com). (June 28, 2022) (Accessed Sept. 6, 2022; https://themesis.com/2022/06/28/the-kullback-leibler-divergence-free-energy-and-all-things-variational-part-1-of-3/ )

Maren, Alianna J. 2022. “Major Blooper – Coffee Reward to First Three Finders.” Themesis, Inc. Blogpost Series (www.themesis.com). (June 2, 2022) (Accessed Sept. 6, 2022; https://themesis.com/2022/06/02/major-blooper-coffee-reward/ )

Maren, Alianna J. 2022. “How Backpropagation and (Restricted) Boltzmann Machine Learning Combine in Deep Architectures.” Themesis, Inc. Blogpost Series (www.themesis.com). (January 5, 2022) (Accessed June 28, 2022; https://themesis.com/2022/01/05/how-backpropagation-and-restricted-boltzmann-machine-learning-combine-in-deep-architectures/ )

Maren, Alianna J. 2022. “Entropy in Energy-Based Neural Networks.” Themesis, Inc. Blogpost Series (www.themesis.com). (April 4, 2022) (Accessed Aug. 30, 2022; https://themesis.com/2022/04/04/entropy-in-energy-based-neural-networks-seven-key-papers-part-3-of-3/ )

Maren, Alianna J. 2014. “The Single Most Important Equation for Brain-Computer Information Interfaces.” Alianna J. Maren Blogpost Series (www.aliannajmaren.com). (November 28, 2014) (Accessed Aug. 30, 2022; https://www.aliannajmaren.com/2014/11/28/the-single-most-important-equation-for-brain-computer-information-interfaces/ )

Related YouTubes

Maren, Alianna J. 2021. “Statistical Physics Underlying Energy-Based Neural Networks.” Themesis, Inc. YouTube channel. (August 26, 2021) (Accessed Sept. 15, 2022; https://www.youtube.com/watch?v=ZazyMS-IDg8&t=324s )

Maren, Alianna J. 2021. “The AI Salon: Statistical Mechanics as a Metaphor.” Themesis, Inc. YouTube channel.. (Sept. 2, 2021) (Accessed Aug. 30, 2022; https://www.youtube.com/watch?v=D0soPGtBbRg)

Maren, Alianna J. 2021. “Statistical Mechanics of Neural Networks: The Donner Pass of AI.” Themesis, Inc. YouTube channel.. (Sept. 15, 2021) (Accessed Aug. 30, 2022; https://www.youtube.com/watch?v=DjKiU3qRr1I)

Books, Syllabi, and Other Resources

Feynman, R.P. 1972, 1998. Statistical Mechanics: A Set of Lectures. Reading, MA: Addison-Wesley; Amazon book listing.

Sethna, James. 2006. Statistical Mechanics: Entropy, Order Parameters, and Complexity. Oxford, England: Oxford University Press. (Accessed Sept. 7, 2022; https://sethna.lassp.cornell.edu/StatMech/EntropyOrderParametersComplexity20.pdf )

I just found (for last week’s post) this fabulous little syllabus put together by Jared Tumiel; it’s spot-on, very well-organized, and has enough in it to keep most of us busy for the next several years. It’s hosted on his GitHub site.

Tumiel, Jarad. 2020. “Spinning Up in Active Inference and the Free Energy Principle: A Syllabus for the Curious.” Jared Tumiel’s GitHub Repository. (Oct. 14, 2020.) https://jaredtumiel.github.io/blog/2020/10/14/spinning-up-in-ai.html

I just found this lovely little article by Damian Ejlli. It is a perfect read – if you ALREADY know statistical mechanics … and quantum physics … and Bayesian methods. (So, that rules out … a HUGE number of potential readers.) Other than that, perfectly useful.

Ejlli, Damian. 2021. “Three Statistical Physics Concepts and Methods Used in Machine Learning.” Towards Data Science (Oct. 18, 2021). (Accessed Sept. 11, 2022; https://towardsdatascience.com/three-statistical-physics-concepts-and-methods-used-in-machine-learning-f9cc9f732c4 )