The Kullback-Leibler Divergence, Free Energy, and All Things Variational

Let’s talk about the Kullback-Leibler divergence. (Sometimes, we call this the “K-L divergence.”) It’s the foundation, the building block, for variational methods.

The Kullback-Leibler divergence is a made-up measure. It’s not one of those “fundamental laws of the universe.” It’s strictly a made-up human thing.

Nevertheless, it’s become very useful – and is worth our attention.

Suppose that you start studying the K-L divergence on your own.

This generally goes pretty well, because it’s not that complicated. There are all sorts of web-based tutorials, videos, and blogposts on this. You can understand the K-L divergence with less than an hour’s worth of effort. (For a blogpost on this, using the same notation as used in Kullback and Leibler’s paper, check out my 2014 post, “The Single Most Important Equation for Brain-Computer Information Interfaces.” Link and full citation at the end of this post.)

The problem is not in learning the K-L divergence itself.

The problem is a lot sneakier and more subtle than learning the basic concept.

This sneaky, annoying, subtle thing is that different authors use different notation.

But First, a Little Digression

I’ve recently moved from the Big Island of Hawai’i, on the far right (east) of the Hawai’ian island chain, to Kaua’i, which is to the far left (west) of the island chain. This has meant all of the usual household-moving tasks, chores, and multiple levels of exhaustion.

My car arrived five weeks after I did (yes, that was the soonest that I could get it shipped), and now I’m feeling my way around “my side” of the island, and learning the what-is-where on Kaua’i.

So – every time that we move into a new area, there’s this whole matter of learning the new “lay of the land.”

My dears, there is a world of difference between studying a place with Google maps and actually driving the roads.

As an illustration, here’s three images of Lihue – the government/business/everything-center of Kaua’i.

Three Lihue images (cropped). — CAPTION: Three images of Lihue, the government/business/commerce town of Kaua’i. (a) Aerial image, taken from the north and looking south (note the ocean to the far left). Courtesy Dreamstime. (b) Google map of Lihue. (c) Google street view, of Pikake St. near the junction with Nuhou.

So far, I’ve spent several hours driving around Lihue (and other areas) – getting lost, getting found, getting lost again … (Just as a word – old car. No GPS. And I don’t use the phone while driving.)

Studying Google maps in advance gives SOME help, but the translation is not always easy.

Reading research papers that are on the same topic, but by different authors, is analogous to working with three different kinds of maps and images.

Trying to cross-correlate these maps and images puts us in a Donner Pass situation.

When we have lots of little terms and notation variations flying around in our heads, we have a “white-out blizzard death in Donner Pass.”

And Sometimes, We Need Just a Little Bit More …

Back to story-telling.

It was the first week that I had my car.

I was stressed; my car was stressed.

My car had spent over a month sitting at a friend’s home, waiting for its journey to the port and to its (second-ever) ocean voyage.

I wanted to take my car to Tire Warehouse, owned by a friend of a friend. The Tire Warehouse is located in downtown Lihue.

On my best of days, spatio-visual skills are my weakest “strength,” and right then, that strength was VERY weak.

I studied how to get there, but once in Lihue, I got turned around and lost.

I found the local public library. (Somehow, I can ALWAYS find the library!)

I went inside, in full “damsel-in-distress” mode.

The very kind security guard, William, tried to give me directions.

My brain just was not processing. “Can you draw me a map?,” I pleaded.

He drew me a map.

CAPTION: Hand-drawn map of downtown Lihue, courtesy of William, security guard at the Lihue public library. NAPA Auto Parts – the first landmark – is in the center. Photo taken with my iPhone.

“Turn left at NAPA Auto Parts,” he said. “You can’t miss it.” (And more directions followed.)

I turned left, followed the rest of the directions, and got to Tire Warehouse safely.

A hand-drawn map is an entirely different kind of notation than an aerial image, a Google map, or a Google street view. Each of these is a representation of a physical system, and each of them conveys different information, and uses different notation.

Sometimes, we can’t limit ourselves to just one information source – as in, teaching ourselves from a textbook.

Sometimes, we need MULTIPLE sources, and these multiple sources present the same information. It’s just that the WAY in which each one presents is a little bit different from the others.

Or one source makes a certain point, or has a certain emphasis – SOMETHING that puts it on our “must-read” list, even though other sources also do a competent job.

This is what it’s like when we are translating concepts and equations from one paper to another.

The Kullback-Leibler Divergence

Our first step in this whole variational inference journey is to look at the Kullback-Leibler divergence, which is the foundation for what happens next.

The Kullback-Leibler divergence is a measure of how well a model matches a data set.

Invented by Kullback and Leibler in 1951, it has become – via some shrewd mathematical manipulations – to be a keystone element of machine learning methods, particularly all things variational. But … we’ll get to that part later.

For now, we want to just look at what it is. (Again, to see the equations, check out my 2014 post, “The Single Most Important Equation for Brain-Computer Information Interfaces.” Link and full citation at the end of this post.)

For completeness, here’s the basic notion – and the basic equation.

Eqn. 1. The basic, well-known form of the Kullback-Leibler divergence equation. Notice that in this form, *p(x)* represents the MODEL, and *q(x)* represents the DATA. This is the same notation as used by Friston et al., Beal et al., and Blei (see discussion later in this post). It is a REVERSAL of the notation used by many researchers, and even that used in the Wiki on the Kullback-Leibler.

As we can see, it’s not that complicated.

Notation is always the most delicate and nuanced aspect of any study, and is often the thing that causes us the most time and trauma.

Kullback and Leibler, in their 1951 original paper, frame their divergence measure very abstractly. They talk about measuring the divergence between any two collections of data, and use the traditional f(x), g(x) notation in their paper. (Link to their original paper in the References section at the end of this post.)

(Note: the notation in Eqn. 1 is based on works by Friston et al., see refs. at the end.)

The Indiana Jones Part of the Story

Have you watched any of those adventure movies in which the archeologists/treasure-hunters start with a scrap of an old map or an ancient book, and need to find an expert to translate the original material?

Sometimes, it’s like that when you start looking at the original material – in ANY discipline.

So (back to the movie), there they are … ancient manuscript in hand, spread out on a desk well-lit with candles. (Candles??? When the right approach is UV-fluorescent lighting? OK, so much for movie-mood-making.)

In my case, I was cross-correlating between three reference sources:

Karl Friston’s works – because Friston & Co. are developing something called active inference, which I think will be the leading AI method (and will substantially outpace reinforcement learning), and I’m using Friston’s work as the basis for my own notation,
Matthew Beal’s 1984 Ph.D. dissertation on variational inference – which is the source that Friston & Co. have referenced in their works, and which has notation that Friston (pretty much) follows, and
David Blei and colleagues’ work, which is another cornerstone of variational inference.

There was just enough of a difference between their respective notations that I wrote a whole 62-page tutorial for myself, translating between the different notations. I called it a “Rosetta stone” paper, because like the original Rosetta stone, it cross-correlated between three different notations; Friston’s, Beal’s, and Blei et al.’s.

This Still Didn’t Save Me

Did you catch my post a few months back, offering a coffee reward to those of you who could find the blooper in the “Variational Bayes” tutorial paper that I wrote? (See the blogpost link below; it’s the June 2nd, 2022 blogpost.)

Well, I made a HUGE blooper.

You know, the kind that is like spilling a full cup of coffee all over yourself just as you enter a major staff meeting? THAT kind.

What happened is that I spent so much time paying attention to how Friston, Beal, and Blei et al. differently used x, y, and z that I didn’t pay attention to how all three of them – collectively – used P and Q differently from EVERYONE ELSE.

So when I wrote up this self-tutorial, my equations (using P and Q, which is what people use these days instead of Kullback and Leibler’s original f(x) and g(x) ) were the right equations – but my interpretation of P and Q was reversed. (Because in my head, I was using P and Q in the context that most people do … not in the context of Friston, Beal, and Blei et al.)

I didn’t find this bloop until I put the equations into code, and something that should have made sense … didn’t.

So I had to go back to my original source for the code (my tutorial paper), and from there to the sources for that paper (Friston, etc.), and that’s when I discovered the blooper.

For most of the world, when we use P and Q notation, we use P to represent our data, and Q to represent our model. There are ALL SORTS of papers with that notation – including the venerable Wikipedia. (Not what we like to cite as an authoritative source … but … just saying.)

In the Friston world (and also that of Beal and Blei et al.) P means the model, and Q means the data. In other words, the meanings are reversed.

I got caught up in the very fine-grained little details, and I missed the bigger-picture details.

I’m sharing this with you because it’s a classic example of getting caught in a Donner Pass white-out blizzard.

It didn’t QUITE kill me. I’m still in the game; still working with the math. But … it COULD have been my undoing – and that of anyone else who followed my trail. (That is, tried to build their own code after reading my paper, instead of going back to my sources.)

Why This Is an Important Morality Story

“Morality stories” are those designed to teach us the value of staying on the straight-and-narrow path of good and righteousness.

If you can’t be a good example, then you’ll just have to be a horrible warning.”
Catherine Aird, https://www.goodreads.com/quotes/1963-if-you-can-t-be-a-good-example-then-you-ll-just

The point of this little “morality story” is – if I made such a major blooper (and I was trying really, REALLY hard to get things right), then this can happen to almost anyone.

Not you, maybe.

But – those OTHER PEOPLE. Yes, THEY could have a problem.

It’s not the notation that gets us.

It’s the cross-correlation between multiple notations that induces the “white-out mind freeze.”

It’s when we have two or three (or more) different notations, all discussing the same thing,

If we’re going to mature in our field, we can’t always rely on pre-digested information. Books and blogs produced by someone else are always influenced by THEIR interpretation.

Sometimes, we just have to go into the harsh mountains on our own, and get lost, then found, then lost again …

The thing that we need to be very careful with is the notation.

To your health, well-being, and outstanding success!

Alianna J. Maren, Ph.D.

Founder and Chief Scientist, Themesis, Inc.

P.S. – I still have to go back and fix that tutorial paper. When I do, you’ll see a new version on arXiv, with a date around this time – September, 2022.

And … now on my list of “things-to-do” – is a paper that I’ve started that is SPECIFIC to detailing out the notation for the Kullback-Leibler divergence, as expressed by different authors. The P’s and the Q‘s. The x‘s, y‘s, and z‘s.

Creating meaning in an otherwise semi-chaotic world.

A Useful Book

Not everyone wants to sit down for a good read … with a book on statistical mechanics.

However, if you must … if you really must … I kind-of like the one written by James Sethna.

Science grows through accretion, but becomes potent through distillation. “
James Sethna, Statistical Mechanics: Entropy, Order Parameters, and Complexity. p. 3, “What Is Statistical Mechanics?”

That’s the quote that got me.

I have a small collection of stat-mech books. (All of them, thankfully, in storage somewhere.) I remember them as being very, VERY heavy. In both a literal and figurative sense.

But somehow (one of the rare times I followed through with an Academia recommendation), I came across Sethna’s book (free PDF). And what struck me was that Sethna was attempting to find those moments of lucid clarity that make a subject such as this an inspirational one, instead of a deadening round of derivations.

So … if you’re truly going to follow through and study a bit of stat-mech on your own, this might truly be a worthwhile read.

Also, none of us HAS to read the whole thing. Little bits and pieces are JUST FINE.

So if you catch me referencing something and want a quick little dive into more detail, this book is a good choice – and I would say, much more readable (on this subject) than our dear old Wiki.

Have fun, darlings! – AJM

Sethna, James. 2006. Statistical Mechanics: Entropy, Order Parameters, and Complexity. Oxford, England: Oxford University Press.

References

NOTE: The references, blogposts, and YouTubes included here will be useful across all blogposts in this Kullback-Leibler / free energy / variational inference series. These references are largely replicated across all the blogposts in the series.

Beal, M. 2003. Variational Algorithms for Approximate Bayesian Inference, Ph.D. Thesis, Gatsby Computational Neuroscience Unit, University College London. pdf.

Blei, D.M. Variational Inference: Foundations and Applications. (Presented May 1, 2017, at the Simons Institute.) http://www.cs.columbia.edu/~blei/talks/Blei_VI_tutorial.pdf

Blei, D.M., A. Kucukelbir, and J.D. McAuliffe. 2016. “Variational Inference: A Review for Statisticians.” arXiv:1601.00670v9 doi:10.48550/1601.00670 (Accessed June 28, 2022; pdf. )

Blei, D.M., Andrew Ng, and Michael Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3: 993-1022. (Accessed June 28, 2022;https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf )

Friston, K.; Levin, M.; Sengupta, B.; Pezzulo, G. 2015. Knowing one’s place: a free-energy approach to pattern regulation. J. R. Soc. Interface. 12, 20141383. doi:10.1098/rsif.2014.1383. pdf.

Friston, K. Life as we know it. 2013. Journal of The Royal Society Interface. 10. doi:10.1098/rsif.2013.0475 pdf.

Friston, K. 2010. The free-energy principle: a unified brain theory? Nature Reviews Neuroscience. 11 (2), 127-138. online access.

Kullback S, and R.A. Leibler. 1951. “On Information and Sufficiency.” Ann. Math. Statist. 22(1):79-86. (Accessed June 28, 2022; https://www.researchgate.net/publication/2820405_On_Information_and_Sufficiency/link/00b7d5391f7bb63d30000000/download )

Maren, Alianna J. 2019. “Derivation of the Variational Bayes Equations” arXiv:1906.08804v4 [cs.NE] (Themesis Technical Report TR-2019-01v4 (ajm).) doi:10.48550/arXiv.1906.08804. (Accessed 2022 Sept. 6; https://arxiv.org/abs/1906.08804 .)

Related Blogposts

Maren, Alianna J. 2022. “The Kullback-Leibler Divergence, Free Energy, and All Things Variational (Part 1 of 3).” Themesis, Inc. Blogpost Series (www.themesis.com). (June 28, 2022) (Accessed Sept. 6, 2022; https://themesis.com/2022/06/28/the-kullback-leibler-divergence-free-energy-and-all-things-variational-part-1-of-3/ )

Maren, Alianna J. 2022. “Major Blooper – Coffee Reward to First Three Finders.” Themesis, Inc. Blogpost Series (www.themesis.com). (June 2, 2022) (Accessed Sept. 6, 2022; https://themesis.com/2022/06/02/major-blooper-coffee-reward/ )

Maren, Alianna J. 2022. “How Backpropagation and (Restricted) Boltzmann Machine Learning Combine in Deep Architectures.” Themesis, Inc. Blogpost Series (www.themesis.com). (January 5, 2022) (Accessed June 28, 2022; https://themesis.com/2022/01/05/how-backpropagation-and-restricted-boltzmann-machine-learning-combine-in-deep-architectures/ )

Maren, Alianna J. 2022. “Entropy in Energy-Based Neural Networks.” Themesis, Inc. Blogpost Series (www.themesis.com). (April 4, 2022) (Accessed Aug. 30, 2022; https://themesis.com/2022/04/04/entropy-in-energy-based-neural-networks-seven-key-papers-part-3-of-3/ )

Maren, Alianna J. 2014. “The Single Most Important Equation for Brain-Computer Information Interfaces.” Alianna J. Maren Blogpost Series (www.aliannajmaren.com). (November 28, 2014) (Accessed Aug. 30, 2022; https://www.aliannajmaren.com/2014/11/28/the-single-most-important-equation-for-brain-computer-information-interfaces/ )

Related YouTubes

Maren, Alianna J. 2022. “The AI Salon: Statistical Mechanics as a Metaphor.” Themesis, Inc. YouTube channel.. (Sept. 2, 2022) (Accessed Aug. 30, 2022; https://www.youtube.com/watch?v=D0soPGtBbRg)

Maren, Alianna J. 2022. “Statistical Mechanics of Neural Networks: The Donner Pass of AI.” Themesis, Inc. YouTube channel.. (Sept. 15, 2022) (Accessed Aug. 30, 2022; https://www.youtube.com/watch?v=DjKiU3qRr1I)

Books and Syllabi

Feynman, R.P. 1972, 1998. Statistical Mechanics: A Set of Lectures. Reading, MA: Addison-Wesley; Amazon book listing.

Sethna, James. 2006. Statistical Mechanics: Entropy, Order Parameters, and Complexity. Oxford, England: Oxford University Press. (Accessed Sept. 7, 2022; https://sethna.lassp.cornell.edu/StatMech/EntropyOrderParametersComplexity20.pdf )

I just found this fabulous little syllabus put together by Jared Tumiel last night; it’s spot-on, very well-organized, and has enough in it to keep most of us busy for the next several years. It’s hosted on his GitHub site.

Tumiel, Jarad . 2020. “Spinning Up in Active Inference and the Free Energy Principle: A Syllabus for the Curious.” Jared Tumiel’s GitHub Repository. (Oct. 14, 2020.) https://jaredtumiel.github.io/blog/2020/10/14/spinning-up-in-ai.html

The Kullback-Leibler Divergence, Free Energy, and All Things Variational – Part 1.5 of 3