Variational Free Energy and Active Inference: Pt 5

The End of This Story

This blogpost brings us to the end of a five-part series on variational free energy and active inference. Essentially, we’ve focused only on that first part – on variational free energy. Specifically, we’ve been after Karl Friston’s Eqn. 2.7 in his 2013 paper, “Life as We Know It,” and similarly in Friston et al. (2015), “Knowing One’s Place.”

Picking up from where we left off last week, this key equation can be written two ways, as shown in Figure 1.

We start with a Kullback-Leibler divergence, and parse it two different ways, resulting in two very different expressions.

Figure 1. (Originally presented as Figure 8 in last week’s post.) Friston’s Eqn. 2.7 from “Life as We Know It” models a system which contains both external states (Psi) and internal states (r), separated by a Markov boundary (s and a). The Kullback-Leibler divergence presented at the top can be deconstructed in two different ways. Each of the results is a valid equation; one is more computationally useful than the other. For simplicity, the modeling parameter m is not shown in the model probability distribution p(Psi, s, a, r).

(Note: The term on the bottom on the right-hand-side column – the sum involving only q(Psi | r) – is what Friston expresses as H in his Eqn. 2.7, as an entropy term.)

A key step, introduced in the previous post, is that we represent the distribution of external states as conditioned on the internal states; q(Psi | r). More specifically, we say that our distribution of the external states Psi is conditioned by our prior knowledge of the internal states r; creating a posterior distribution. In this expression, our external states Psi are now our latent (hidden, or dependent) variables. The independent variables r are those associated with the internal states; the system (really, the sub-system) enclosed by the Markov blanket.

We also create a model of EVERYTHING (Psi, r, s, and a) using modeling parameters m.

These two terms are used in Friston’s Eqn. 2.7, shown in Figure 2.

Figure 2, introduced earlier as Figure 1 in the first post in this series. Friston (2013) presents Eqn. 2.7 in his Lemma 2.1, “Free Energy,” in “Life as We Know It.”

We construct a Kullback-Leibler divergence between our distribution q(Psi | r) and our model p(Psi, r, s, a | m), and use this as a springboard.

We can deconstruct this Kullback-Leibler divergence two different ways, as shown previously in Figure 1.

However, Eqn. 2.7 (in Figure 2, above) is only half the story. There is a second way in which we can deconstruct that original Kullback-Leibler divergence.

The following Figure 3 shows us BOTH potential deconstructions.

Figure 3. (Originally shown as Figure 6 in Part 2 of this blogpost series.) The etymology of Eqn. 2.7, expressed in fuller form (Friston et al. 2015), and its initial point (Friston, 2013).

We show these two resultant equations in a “close-up view” in the following Figure 4.

Figure 4. The two ways in which we can deconstruct the initial Kullback-Leibler divergence of Figure 2; shown again in equation-form in Figure 3.

Only one of these two deconstructions is “computationally useful,” according to Friston. That is the “left-hand-path,” shown in Figure 1.

The solution to equation (3.2) implies the internal states minimize free energy rendering the divergence zero … the active states are complicit in this inference, sampling sensory states that maximize model evidence: in other words, selecting sensations that the system expects. This is active inference, in which internal states and action minimize free energy …”

Friston et al. (2015), “Knowing One’s Place.” See also the discussion from p. 25, Maren (2022), “Derivation of the Variational Bayes Equations,” for comments by Beal and Blei et al.

(Note: See more on this, at the very end of this post.)

The Computational Details

This blogpost is restricting itself to the highlights.

The computational details and full interpretation are in my (recently revised) arXiv publication; “Derivation of Variational Bayes.”

I Promised You an Example

No, I haven’t forgotten.

And yes, I DO have an example.

HOWEVER.

(Or, “However, comma, …” as my mother used to say.)

The example involves a system that can, in and of itself, be brought to free energy equilibrium. It’s a very lovely, very VISUAL example.

But a bit too much to just throw in here as an “add-on.”

If you really, truly, seriously want to see an example, NOW, then check out these two arXiv articles – one written earlier this year, and the other revised just last week.

  • Maren, Alianna J. 2022. “Derivation of the Variational Bayes Equations.” arXiv:1906.08804v5 [cs.NE] 4 Nov 2022. (V5 accessed Nov. 08, 2022; available online at Derivation of the Variational Baye s Equations.) (This is the revised article that has all the P and Q notation cleared up. It derives the Friston equations, with Beal as a primary reference, and a small nod to work by Blei et al.)
  • Maren, Alianna J. 2022. “A Variational Approach to Parameter Estimation for Characterizing 2-D Cluster Variation Method Topographies.” arXiv:2209.04087 [cs.NE] 9 Sept 2022. (Accessed Nov. 08, 2022; available online at A Variational Approach …)

However, if you can wait just a bit, we’ll be discussing these in depth, over the next several weeks. (There are some other topics to wrap up, so there may be a short gap before we go into this – it’s the 2-D cluster variation method, with a means to find the “best fit” parameters.)

Active Inference

Yes, I also promised you something about active inference.

Rather than make this blogpost longer than it already is, let me invite you (once again) to a most excellent review article by Noor Sajid et al.

Here’s an extract that summarizes why I think that active inference will be so useful:

… in active inference an agent’s interaction with the environment is determined by action sequences that minimize expected free energy (and not the expected value of a reward signal).”

Sajid et al. (2020). “Active Inference: Demystified and Compared.” (p. 5)

Why Most People Think This Is Hard

There is almost a community-wide joke about how difficult it is to read Karl Friston’s work.

Yes, his notation is subtle and complex.

And yes, he tends to operate at a high level of abstraction and make some leaps among the mountain crags – leaps that are best done by mountain goats, and not mere mortals.

At the same time, I think that there are really three reasons that most people have not been able to “read” Friston’s papers:

  • Lack of adequate preparation. Reading Friston requires a lot of prior knowledge. Similarly, reading anything by Blei (e.g., his “Variational Inference” tutorial written with Kucukilbar and McAuliffe (2018), or his Latent Dirichlet Allocation” (LDA), written with Ng and Jordan (2003)) is also hard. ALL of the works in this genre require a subtle and elegant maneuvering back-and-forth between statistical mechanics and Bayesian logic, with a good dose of latent variables thrown in. NONE of this is easy, until one has mastered the basics.
  • Jumping into an Ongoing Conversation. One of the things that really popped out, as I re-studied everything (EVERY damn thing), and re-read papers, and read predecessor papers … etc.) is that NONE of these works are ab initio. They are ALL continuations of lines of thought that began as much as ten or even twenty years prior. This is true not only in the variational world, but in deep learning. Anyone jumping into Hinton & Salakhutdinov’s (2006) “Reducing the Dimensionality of Data with Neural Networks”), or their next major milestone; Salakhutdinov & Hinton (2012), “An Efficient Learning Procedure for Deep Boltzmann Machines,” has the same problem. Anyone jumping into Steve Grossberg’s works (we’ll be addressing them soon) will have the same problem. These are all conversations that began very early; some just around 2000; some in the 1980’s.
  • Notation. It always comes back to notation. I wrote the entire (now revised) arXiv paper because I needed to keep track of notation across three different authors (authoring teams), each of which used subtly different notation. And – as always – notation is a “compact representation” for typically complex, subtle, and abstract notions. We have to wrap our heads around the notions first, then understand exactly how the notation represents them. Not always easy.

Things to Read (in English, instead of Greek)

Here are three directions for wrapping our heads around Friston; largely efforts (by others) to write intuitive (equations-free) interpretations:

In sum: it’s difficult, but not impossible.

Even the “intuitive” explanations take a lot of work.

The real scientific papers take a WHOLE lot of work. And there are subtleties.

How I Got So Very Confused (and Started This Adventure)

This whole thing started several years ago, when I was attempting to read Karl Friston’s papers. This meant, of course, attempting to teach myself variational Bayes. The two best sources were Matthew Beal’s Ph.D. dissertation (2003) (which Friston had referenced), and a tutorial written by David Blei, Alp Kucukelbir, and Jon McAuliffe (2016).

Naturally, the Blei et al. paper was not a direct lead-in to Friston, as it was published in 2018, and the two Friston papers that were central to my studies “Life as We Know It” (2013) and “Knowing One’s Place” (Friston et al., 2015). However, the Blei et al. paper offered some very useful insights into variational Bayes (which was its purpose), and so there it was – a complement to Beal’s exposition.

That Old Notation Thing

The problem, as always, was with notation.

Here’s a quick (edited & updated) extract from the self-tutorial paper that I completed in 2019:

Note that in this explanation by Blei et al., the observable variable was denoted as x instead of y, but the independent (and hidden) variable was denoted z. Thus, in Beal, the observables are y and the latent are x, and in Blei et al., the observables are x and the latent are z.

Maren, Alianna J. 2019. arXiv:1906.08804v5. (Extract from p. 14)

This led to a constant mental shell-shuffling game, and I simply couldn’t keep up.

So I wrote a tutorial for myself, and called it a “Rosetta stone” translation; the notations from Friston, Beal, and Blei et al. would all get cross-compared and put into a consistent reference frame.

The Rosetta stone: an ancient Egyptian stele (carved granite) containing the same decree by Ptolemy V Epiphanes (written in 196 BC); the same decree was inscribed using three forms of written text: two of these were Ancient Egyptian (hieroglyphic and Demotic scripts, on the top), and Ancient Greek (on the bottom). (Photo 112118547 / Rosetta Stone © Pxlxl | Dreamstime.com
)

The reason that I called this a “Rosetta stone” paper was that up until the time that the Rosetta stone was discovered, no one could read Egyptian hieroglyphs. This stele, or carved stone with the same text in three languages, made it possible to begin translating hieroglyphs, and opened up new studies of ancient Egypt.

So in 2019, I got the paper written – and uploaded to arXiv. Over sixty pages of long formulas, replicating what Beal and Friston had independently done, and cross-referencing their notation against each other.

And this worked, sort of.

The thing is … I got so tied up into tracking the x’s, y’s, and z’s that I lost track of the big one – the P and Q notation. Specifically, both Friston and Beal (and later, Blei et al.) reverse the use of P and Q (or p and q) notation from what is more commonly used.

Throughout most of the civilized world, when writing the Kullback-Leibler divergence, most people use P (or p) to represent the data distribution, and Q (or q) to represent the model.

But Friston, Beal, and Blei et al. don’t. The use P to represent the model (P is p(y | theta), where theta is the (set of) model parameter(s).) Q represents the data distribution.

So there I was, fussing at the details, and having all sorts of fogginess about the big things – the respective roles of P and Q.

In short, I tripped over my own feet – badly.

“Thinking of Things”

It wasn’t until Christmas of 2021, when I ran code based on my equations, and saw that the code results were simply wrong – that I realized that something was amiss.

So, I traced the code-equations back to my arXiv paper-equations, and from there to the original sources (Friston and Beal), figured out my mistake, re-did the code (it then gave beautiful results!) … and realized that I’d have to re-write the whole arXiv paper.

Meaning, I’d written the whole thing with a certain muddleness-in-the-head, and I had to clear that out.

When you are a Bear of Very Little Brain, and you Think of Things, you find sometimes that a Thing which seemed very Thingish inside you is quite different when it gets out into the open and has other people looking at it.”

A.A. Milne, Winnie-the-Pooh
Winnie-the-Pooh. Photo 152251281 © Maksym Velishchuk | Dreamstime.com

Why I’m Telling This Story

If I can make this mistake, so can anyone.

There I was – doing my very damn best to unscramble the meaning of things – and missing the forest for the trees. Classic mistake.

But … notation is always the hardest thing.

I’m willing to bet a very good dinner (with wine) that the reason that there has not been as much work with Karl Friston’s active inference has been due to notational conundrums.

I’m willing to bet that a lot of people have gotten lost in these same mountains – which I’m now calling Donner Pass #2 – and have just quietly drifted off to something that was not causing such a mental blizzard white-out.

Figure 3. Originally presented as Figure 1 of the first blogpost in the predecessor series, Alianna J. Maren (2022), “The Kullback-Leibler Divergence, Free Energy, and All Things Variational (Part 1 of 3).”

The variational free energy work (and along with it, active inference) is the “California Gold Coast” of AI and machine learning.

If we can get through (even some of) Friston’s articles, then we are standing on top of the western edge of the Sierra Nevada mountains of the AI Oregon Trail. We are looking out at the (ever-expanding) terrain that extends towards the ocean.

Not quite down to the coast just yet, but the way is clear.


Trailblazing and Map-Making

Where we are now, in the fifth post on this series, is finally interpreting the key variational free energy equations as used by Friston (op. cit.).

The four posts prior to this, and the five posts in the prior series (Kullback-Leibler, Free Energy, and All Things Variational) are a very long ramp-up.

We can think of those posts as forming a long, carefully-marked trail, with guardrails along the most tricky areas.

The deeper exploration is in the now revised and updated arXiv article.

I thought about going through the final equations here.

But really, we need delicate formatting and a lot of attention to detail.

So, instead of trying to put all that subtlety into a blogpost, please consult the arXiv article.


Which Equation Do We Prefer, and Why

The big question, when we look at the free energy equation diagrammatically Figure 1 (and the same thing, in mathematical form, in Figure 2), is: of those two variants offered, which do we prefer the most – computationally?

For a slightly tongue-in-cheek answer, Jason Eisner posted THIS ANSWER to that question, quoting David Blei:

So by now, you may be asking:

What is the difference between full Bayesian inference, variational inference, variational Bayes, variational EM, and stochastic variational inference?

Question posted on Quora.com

There are (of course) a set of really great responses on Quora.com. I particularly like the one by Jason Eisner. It’s just slightly off-topic, but I do love his quote of David Blei!

[A2A] Speed is indeed the main reason to use variational methods. David Blei told me long ago, “Variational inference is that thing you implement while waiting for your Gibbs sampler to converge.” :-)”

Answer posted on Quora.com

For a detailed discussion, please see p. 25 of Maren (2022), “Derivation of the Variational Bayes Equations.” This contains extracts of comments by Friston, Beal, and Blei et al.


How to Stay Informed

This is the fifth (and final entry) in the blogpost series on Variational Free Energy and Active Inference. We’re anticipating weekly posts, and a few YouTubes as well. To be informed as soon as these blogs / YouTubes come out, please do an Opt-In with Themesis.

To do this, go to www.themesis.com/themesis.

(You’re on www.themesis.com right now. You could just hit that “About” button and you’ll be there.)

Scroll down. There’s an Opt-In form. DO THAT.

And then, please, follow through with the “confirmation” email – and then train yourself and your system to OPEN the emails, and CLICK THROUGH. That way, you’ll be current with the latest!

Thank you! – AJM



Resources & References

This Week’s Read-Along’s

These are the primary works specifically identified in today’s post. (And I still need to add in some formal cites for the more “intuitive” blogposts – by others – on the whole Friston ecology.)

  • Beal, M. 2003. Variational Algorithms for Approximate Bayesian Inference, Ph.D. Thesis, Gatsby Computational Neuroscience Unit, University College London. (Accessed Oct. 13, 2022; pdf.)
  • Blei, D.M., A. Kucukelbir, and J.D. McAuliffe. 2018. “Variational Inference: A Review for Statisticians.” 2018. arXiv:1601.00670v9 doi:10.48550/1601.00670 v9 (9 May 2018). (Accessed June 28, 2022; pdf. )
  • Friston, Karl. 2013. “Life as We Know It.” Journal of The Royal Society Interface10. doi:10.1098/rsif.2013.0475. (Accessed Oct. 13, 2022; pdf.)
  • Friston, K., M. Levin, B. Sengupta, and G. Pezzulo. 2015. “Knowing One’s Place: A Free-Energy Approach to Pattern Regulation.” J. R. Soc. Interface12:20141383. doi:10.1098/rsif.2014.1383. (Accessed Oct. 3, 2022; pdf.)
  • Maren, Alianna J. 2022. “Derivation of the Variational Bayes Equations.” arXiv:1906.08804v5 [cs.NE] 4 Nov 2022. (V5 accessed Nov. 08, 2022; available online at Derivation of the Variational Baye s Equations.)
  • Sajid, N., Philip J. Ball, Thomas Parr, and Karl J. Friston. 2020. “Active Inference: Demystified and Compared.” arXiv:1909.10863v3 [cs.AI] 30 Oct 2020. (Accessed 17 June 2022; https://arxiv.org/abs/1909.10863 )


The Resources and References list has grown JUST TOO DAMN LONG.

I’m pulling it out, and will create (possibly as an additional post for this week, possibly as next week’s post) a separate Resource Compendium.

I’m thinking of keeping the same Bunny Trails, Blue Squares, and Black Diamonds base structure – and also adding in (as a prelude) a suggested route for self-study.

Any comments? Any thoughts about what YOU’D like to see?

Thank you! – AJM



… And Some Music …

We had a full lunar eclipse last night.

I woke up around 10:30 PM, and put on my sandals (which are appropriate night-time footwear in Hawai’i, even in November), and went out to look at the moon … the shadow just starting … the full eclipse (I looked all over, and didn’t see the moon at all – weird feeling) … then the moon largely emerged from shadow.

Cat Stevens. Cult fave of the 1970’s. “Moonshadow.”

What could be more fitting?

Cat Stevens, performing “Moonshadow”

The goddess Lilith is associated with dark moon energy.

The astrological placement of Lilith represents unrestricted sexual energy, karma, rebellion, chaos, and what lurks in the shadows. “

Lilith: The Dark Moon Goddess. Post on The Crone, July 24, 2020.

#MeToo, anyone?


3 comments

  1. Your 3 reasons are very true. I am guilty of falling into traps #1 and #2. Regarding notation, my bugbear is that math is a notation that is TOO compact. So compact that it obfuscates (maintaining an aura around mathematicians as super-intelligent). Paper and web space is very cheap these days; how about good programming practice of using meaningful variable names? Limit to 20 characters maybe. Use ‘data_probdist’ instead of ‘q’ etc. (Imagine what their Fortran77 programs are like, with variables I, J, K and N, M and P throughout.)
    Even better, turn the pseudocode equations into real code [without calling upon obfuscating Matlab functions in SPM12] and put it in github! Then it will be completely unambiguous.

    Math rant over. I loved the Gibbs sampler joke and will forever think of Markov Blankets in terms of Downton Abbey.

    1. Hi, Neil – well, that makes two of us, doesn’t it? (With regard to reasons #1 & #2 (“lack of adequate preparation,” and “jumping into an ongoing conversation,” respectively). And it’s taken me THIS LONG to really come to grips with this, and slow down, and do things more thoroughly and systematically.

      So as a side comment – what we’re talking about here is SO NOT TRIVIAL and SO NOT EASY.

      That’s the thing behind Scott Alexander’s report, citing Peter Freed’s report (in the journal “Neuropsychoanalysis”) of a monthly meeting where very high-powered researchers (he cited their various degrees and total funding) attempted to read a Friston paper … within a month. And when they couldn’t, it was Friston’s fault.

      Well, kind-of, sort-of, right?

      Some of Friston’s language is obscure – realistically, more of a private language that he’s invented over the course of his work. He and his most immediate colleagues speak it fluently; everyone else is scratching their heads.

      But that’s the result of being hugely isolated during the early stages of identifying both the problem and the approach, wouldn’t you agree?

      And this “special language” – I think of it sort of as “twin-speak” – the language that twins might evolve to speak just to each other. Although in this case, Friston was really just trying to communicate to himself, then to a very small set of colleagues – and those of us who are attempting to teach ourselves from the outside … it’s just a longer process, right?

      And w/r/t your points re/ “meaningful variable names” and the overall compactness of math representation – so totally spot-on!

      Except that, as we become more and more intimate and familiar with our topic – immersed in it, so to speak – we can let go of the need for the support provided by longer (more meaningful) variable names.

      So we’re all … I won’t say wrestling with this, but … coming to a deeper level of understanding … sort of a “being-one-with-the-equation.”

      For me, it’s a lot like that Zen practice of “holding the question,” sort of like a “what is the sound of one hand clapping?” type of thing.

      This whole study is just hugely (and annoyingly) like being “before enlightenment”!

      Thanks again for your thoughtful comment! – AJM

Comments are closed.

Share via
Copy link
Powered by Social Snap