The Kullback-Leibler Divergence, Free Energy, and All Things Variational (Part 1 of 3)

Variational Methods: Where They Are in the AI/ML World

The bleeding-leading edge of AI and machine learning (ML) deals with variational methods. Variational inference, in particular, is needed because we can’t envision every possible instance that would comprise a good training and testing data set. There will ALWAYS be some sort of oddball thing that is a new experience for our AI/ML system.

By Way of Analogy: The Artificial Intelligence “Oregon Trail”

By way of analogy, variational methods define the ever-evolving “California Gold Coast” of artificial intelligence. They are the end-point of our journey through the “Artificial Intelligence Oregon Trail.”

I’ve used this analogy multiple times, in YouTube vids – an example is the one inserted below: Statistical Mechanics of Neural Networks: The Donner Pass of Artificial Intelligence.

CAPTION: Maren, A.J. 2021. “Statistical Mechanics of Neural Networks: The Donner Pass of Artificial Intelligence.” Themesis, Inc. YouTube channel.

In this analogy, we all (collectively) start our AI journey in a metaphorical “Elm Grove, Missouri.” We typically get as far as a metaphorical “Fort Laramie, Wyoming.”

CAPTION: The “Oregon Trail” of Artificial Intelligence, with TWO Donner Passes. The first (traditional) one is converging disciplines to understand energy-based neural networks (e.g., the Boltzmann machine) and all manners of “deep” neural architectures. The second (new) one is further towards the California coast, and is the convergence of the same disciplines, but this time yielding variational methods – which are the ever-evolving “Gold Coast” of California.

To get to Fort Laramie, we go through prairie grasslands, learning gradient descent methods – e.g., backpropagation. We learn how to USE more advanced neural networks – e.g., using TensorFlow and Keras – and we can create layered “deep” systems, and combine convolutional neural networks (CNNs) and long short-term memory (LSTM) networks into reasonably complex systems.

We can get some very impressive results at this point, and are typically very good with the mechanics of making our systems work. We understand that there ARE such things as an “energy-based neural networks,” a.k.a. Boltzmann machines – but generally don’t have a solid grasp on how they work as differentiated from multilayer Perceptron (MLP)-based neural networks.

This is as far as many of us get, theoretically.

This means – there are a whole lot of us camped out in Fort Laramie, Wyoming.

Getting from Elm Grove to Fort Laramie is a very solid achievement. In the MSDS 458 AI & Deep Learning class in Northwestern University’s Master of Science in Data Science program (where I currently teach), it takes students a full ten-week quarter to get this far. Historically, this compares with the approximate six weeks that it took the pioneers to travel through the prairie grasslands.

The second major portion of the journey is getting through the “Sierra Nevada mountains” of the “AI Oregon Trail.” In pioneer terms, the total journey took four-to-five months, or about 20 weeks. That is, the first (easy) portion was about one-third of the total; the remaining two-thirds were through the mountains.

In this metaphorical journey, the “Donner Pass” situations come up when students need to navigate the convergence of multiple disciplines. (For those of you unfamiliar with the Donner Pass story, check out The Encyclopedia Brittanica version: https://www.britannica.com/topic/Donner-party )

The first of these – what I’ve most often referred to as the “Donner Pass” of AI – involves combining statistical thermodynamics (especially the free energy notion) with neural networks with Bayesian methods to create an energy-based neural network, a.k.a. a Boltzmann machine.

Once we get one Boltzmann machine, we can layer several of them and create various forms of “deep” architectures, using backpropagation to smooth out the connections. (See: https://themesis.com/2022/01/05/how-backpropagation-and-restricted-boltzmann-machine-learning-combine-in-deep-architectures/ )

There is a second Donner Pass; similar to the first one, but distinctly different.

This second Donner Pass lies much closer to the “California Gold Coast,” which is the realm of variational methods.

This second Donner Pass again involves statistical thermodynamics (free energy) and Bayesian logic. Unless you’re doing variational autoencoders, you don’t have a neural network, instead, you have a model. For many variational systems – and particularly for tutorials on this subject – these models are members of the family of exponentials – and are often Gaussian models.

The reason that so many people get lost – in one “Donner Pass” or another – is the need to combine two (or more) fields of study. If you’re lacking background in one of them – especially in statistical mechanics – it is too easy to get disoriented and lost.

Important side note: All too often, various teachers – including top-of-the-line experts and innovative thinkers – downplay the role of stat-mech when they present either Boltzmann machines or variational methods to their students. This happens because they’re trying to make the material “accessible” to an audience that has no idea of what stat-mech is; none whatsoever.

Short-term, this works. But as soon as you start reading the classic literature, this approach shows its weakness – because the classic literature, i.e., all the main papers on Boltzmann machines and their derivatives – assume that you know statistical mechanics SO WELL that the authors don’t even have to mention it in context. This leads to ALL SORTS OF PROBLEMS for people who are just entering the field.

Back to our main theme: Donner Passes and what happens when you’re in one.

The first thing that happens is that you get stuck. That means – you begin to realize that there are several scientific disciplines that are coming together, and you try to get on top of each of them – doing Google searches on keywords, etc. But the truth is – it is DAMN HARD to sidebar into statistical mechanics when you don’t have a good place in which to begin – and it’s hard to begin when you’re already deep into the material that you’re studying – i.e., trying to read a classic paper.

The following figure illustrates the two key “Donner Pass” areas; you can see that they are convergences between statistical mechanics (the free energy notion) and Bayesian probabilities.

CAPTION: There are two “Donner Pass” situations when self-studying your way through advanced AI. The first (“Donner Pass #1) is the convergence between free energy (from statistical mechanics), Bayesian probability, and neural networks (not shown in the figure), resulting in Boltzmann machines. The second (“Donner Pass #2) is the convergence between free energy, Bayesian probabilities, the Kullback-Leibler divergence, and model-optimization (not shown in the figure), to yield variational methods, such as variational Bayes.

So, stuckness happens. Google searches just make you more confused, and cranky, and ultimately, give up – because there is either too much material or not enough, or not enough at the right entry level.

The second thing that happens is “death in the white-out blizzard.” This means – there are a lot of little things coming at you, and you don’t know WHY they’re important. An example is the mention of “Gibbs sampling” and “… approach[ing] the prior distribution” and “Markov chains” – all within a few paragraphs of a VERY classic Salakhutdinov and Hinton paper (2012; see References at the end). Unless you know that they’re addressing the entropy portion of minimizing free energy for a system, you have no idea of – not so much WHAT they’re talking about, but WHY they’re making such an important point of these matters.

This is because the authors are living, swimming, breathing in a world defined by statistical mechanics and free energy minimization. This is so much an “assumed thing” that they don’t even mention it – they trust that whoever is reading their paper has that same reference frame.

That worked very well in 2012. Back then, and within the context of that journal’s readership, that was a safe assumption.

For those entering the field, right now, not so much. So – terms show up that seem disembodied from context, much like biting little snow crystals in a blizzard. Death to those caught unprepared.

This second kind of “blizzard white-out death” happens when we are a bit more advanced, and are working through two or more papers – and get caught in the mental static that comes from cross-comparing notation. The notation can even be from the same author(s), but if from two papers – published sufficiently far apart in time – there can be small differences. And there we are, caught in the “does this mean the same thing as that?” question, and paralysis results. And then, eventual death – in terms of making progress in understanding these classic papers.

And since the way forward – the new advances – depend on understanding prior works, we’re then totally stalled. We’re trapped in some sort of Donner Pass, unable to move forward or backward, and our minds are caught in a permanent blizzard white-out of terms and notation.

The goal of these Themesis posts is to help ease you through your Donner Pass.

Kickstarting Your Variational Study: Where to Begin

I was looking for some good articles that gave context for variational methods – things that were light on the math, easy on the eyes, and were primarily motivational.

Nope. No such luck. (I may find some things over the next few weeks, and if so, will insert into the references in future blogposts – and probably come back to this one and do an update.)

The easiest thing that I’ve found is David Blei’s 221-slide presentation on variational inference, often dubbed “VI” (just as we use “AI” for artificial intelligence, “ML” for machine learning, “DL” for deep learning … you get the drift; and it’s important enough in the AI/ML community to have its own little acronym). So here’s a link to Blei’s presentation – the full reference is below, and the link is HERE: http://www.cs.columbia.edu/~blei/talks/Blei_VI_tutorial.pdf

This is Blei’s “Variational Inference: Foundations and Innovations” presentation – and I am so NOT suggesting that you read the whole thing (unless you really want to). Instead, use it as a kick-starter. There are LOTS of pretty pictures in the beginning, and there are a LOT of graphs and figures throughout – it is visually rich. So, if you’re just wanting a little something before going to bed – you’re not up for the hard-core equations, but you want to get your mind in the right mood – this is a good start.

Blei points out that there are multiple disciplines leading into a study of variational methods. (I think he’s de-emphasizing the stat-mech part.) Here’s his slide from that presentation:

CAPTION: David Blei suggests that a little probability, a little optimization, and a little Bayesian are all that you need to get started in variational inference (VI) methods – I personally think that a bit more than this is needed, but he is indeed pointing out that this is a multi-disciplinary field. Reference and link at the end of this blogpost.

Blei is one of the key developers of variational methods. Also, Blei (with Andrew Ng and Michael Jordan, all three of them heavy-hitters) developed the Latent Dirichlet Allocation (LDA) method in 2003, which kick-started our modern era of natural language processing (NLP) (Blei et al., 2003; see the References at the end). Variational methods and the LDA are very intertwined in their foundational thinking, so you can see where a lot of the NLP advances come from.

Why We Need BOTH the K-L Divergence and Free Energy

The basic notion of variational inference is that we can take a divergence measure (most typically, the Kullback-Leibler divergence), and do some fancy mathematical games with it, and make it look like a free energy equation.

For the Kullback-Leibler equation itself, using Kullback and Leibler’s original notation, go to: https://www.aliannajmaren.com/2014/11/28/the-single-most-important-equation-for-brain-computer-information-interfaces/
To get started with free energy (and its components: enthalpy and entropy), go to: https://themesis.com/2022/04/04/entropy-in-energy-based-neural-networks-seven-key-papers-part-3-of-3/

The free energy that we have in variational inference is NOT, so help me God, a true “free energy.” Free energy comes out of statistical mechanics (statistical thermodynamics), and has a macro-correspondence in classical (macro-level) thermodynamics.

So what happens in the world of variational inference / machine learning is ABSOLUTELY NOT real (statistical) thermodynamics; the equation that is produced is NOT a real free energy equation.

The thing is, this “manipulated” divergence equation looks – formally – LIKE a free energy equation – and this is where the AI/ML/DL/VI theoreticians and developers have had a field day.

It is this metaphorical similarity that makes it important to understand enough stat-mech so that you have context for what is going on in the world of variational-whatevers. Otherwise, it’s like trying to understand a Edwardian-era ode, with many classical references, without knowing classical Greek and Roman mythology. Context is everything.

CAPTION: Maren, A.J. 2021. “The AI Salon: Statistical Mechanics as a Metaphor.” Themesis, Inc. YouTube channel.

We’ll go through this mathematical manipulation, a bit down the road.

The math, though – and I mean the basic, hard-core math – the derivations and such – are not the hard thing. Almost all of us can get through a derivation. (Sometimes, we need to dogleg back and review a bit of previously-learned and now-forgotten math, but we can do that.)

What bites us in the soft-and-tenders, every single time, is correlating between different frames of reference. Meaning – correlating different notations.

This is EXACTLY what led to my “big blooper,” mentioned in the previous post. (See: https://themesis.com/2022/06/02/major-blooper-coffee-reward/ )

It was a notational mistake, and I made it after working through several papers, and carefully producing what I’d called a “Rosetta stone” paper – one expressly designed to match up the notation between three different sources.

So this whole blogpost series falls in the category of “If you can’t be a good example, then at least be a horrible lesson.”

The “horrible lesson” is forthcoming.

To your health, well-being, and outstanding success!

Alianna J. Maren, Ph.D.

Founder and Chief Scientist, Themesis, Inc.

References

NOTE: The references, blogposts, and YouTubes included here will be useful across all three blogposts in this series.

Beal, M. 2003. Variational Algorithms for Approximate Bayesian Inference, Ph.D. Thesis, Gatsby Computational Neuroscience Unit, University College London. pdf.

Blei, D.M. Variational Inference: Foundations and Applications. (Presented May 1, 2017, at the Simons Institute.) http://www.cs.columbia.edu/~blei/talks/Blei_VI_tutorial.pdf

Blei, D.M., A. Kucukelbir, and J.D. McAuliffe. 2016. “Variational Inference: A Review for Statisticians.” arXiv:1601.00670v9 doi:10.48550/1601.00670 (Accessed June 28, 2022; pdf. )

Blei, D.M., Andrew Ng, and Michael Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993-1022. (Accessed June 28, 2022;https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf )

Kullback S, and R.A. Leibler. 1951. “On Information and Sufficiency.” Ann. Math. Statist. 22(1):79-86. (Accessed June 28, 2022; https://www.researchgate.net/publication/2820405_On_Information_and_Sufficiency/link/00b7d5391f7bb63d30000000/download )

Related Blogposts

Maren, Alianna J. 2022. “How Backpropagation and (Restricted) Boltzmann Machine Learning Combine in Deep Architectures.” Themesis, Inc. Blogpost Series (www.themesis.com). (January 5, 2022) (Accessed June 28, 2022; https://themesis.com/2022/01/05/how-backpropagation-and-restricted-boltzmann-machine-learning-combine-in-deep-architectures/ )

Maren, Alianna J. 2022. “Entropy in Energy-Based Neural Networks.” Themesis, Inc. Blogpost Series (www.themesis.com). (April 4, 2022) (Accessed Aug. 30, 2022; https://themesis.com/2022/04/04/entropy-in-energy-based-neural-networks-seven-key-papers-part-3-of-3/ )

Maren, Alianna J. 2014. “The Single Most Important Equation for Brain-Computer Information Interfaces.” Alianna J. Maren Blogpost Series (www.aliannajmaren.com). (November 28, 2014) (Accessed Aug. 30, 2022; https://www.aliannajmaren.com/2014/11/28/the-single-most-important-equation-for-brain-computer-information-interfaces/ )

Related YouTubes

Maren, Alianna J. 2022. “The AI Salon: Statistical Mechanics as a Metaphor.” Themesis, Inc. YouTube channel.. (Sept. 2, 2022) (Accessed Aug. 30, 2022; https://www.youtube.com/watch?v=D0soPGtBbRg)

Maren, Alianna J. 2022. “Statistical Mechanics of Neural Networks: The Donner Pass of AI.” Themesis, Inc. YouTube channel.. (Sept. 15, 2022) (Accessed Aug. 30, 2022; https://www.youtube.com/watch?v=DjKiU3qRr1I)