Variational Free Energy and Active Inference: Pt 1

Overarching Story Line

This new blogpost series, on variational free energy and active inference, presents tutorial-level studies centered on the free energy equation (Eq. 2.7) of Karl Friston’s 2013 paper, “Life as We Know It.”

Specifically, we’re focused on the free energy equation shown in Figure 1 below.

CAPTION: Figure 1: Lemma 2.1, presented as Eqn. 2.7, from Karl Friston, 2013, “Life as We Know It.” (See full citation and link in the Resources at the end.)

Over this blogpost series, we will reinforce our studies with Matthew Beal’s 2003 Ph.D. dissertation as a primary resource, because he gives very clear and lucid explanations. Also, Beal (2003) seems to be a primary source of Friston’s (2013) notation; so we can more easily map Friston onto Beal than to any other source.

In addition, we will also bring in some comments made by Blei et al. (2018), because they make very useful points about variational inference (variational Bayes). This will introduce some notational confusion, and one of the primary goals of this series will be to untangle the notation across these three sources as much as possible.

Our attention will be restricted to just the basic free energy of Friston’s (2013), “Life as We Know It.” We’re going to skip over a lot of meaty components, such as the evolution equations (Fokker-Planck), etc. This means that we have to leave some important concepts – e.g., “the flow in terms of the ergodic density” – untouched. Regrettable, but otherwise, we’d be here for the next several years.

Thus, we’re going to give our attention – over a series of posts – to the following elements:

  • A system with an embedded Markov blanket,
  • The notation used by Friston (and Beal 2003, and – a bit later, also by Blei et al. 2018), and
  • The free energy equation introduced as Lemma 2.1.

Specifically, we’re going to obtain the free energy equation, starting from the Kullback-Leibler divergence.

We will assume, throughout this series, that the reader is not just familiar, but fluent, with basic thermodynamic (statistical mechanics) notions such as free energy, enthalpy, and entropy, and is also familiar with the Kullback-Leibler (K-L) divergence and (is somewhat familiar) with Bayesian probability methods.

Those desiring a brief refresher are invited to consult the predecessor blogpost series on “Kullback-Leibler, Free Energy, and All Things Variational,” beginning with the first post in that series. There are numerous resources, especially in the last blogpost of that series, that give useful educational support in the statistical thermodynamics needed for this current series.

This blogpost, as with all in this series (and the prior one), have a rich collection of Resources and References at the end. References are typically given in Chicago (author/date) style, with links to the online sources.

(AJM’s Note: We did not get as far with this material as desired; there was a LOT of going down rabbit holes that led to further, deeper, and more branching rabbit-burrows; we are so far underground right now that it is hard to reach the keyboard to write this post. One of the most interesting things to come out of this back-tracing is the realization that the conversations in Friston’s works circa 2012 – 2015 refer back to developmental thinking about a decade earlier, in works published by Friston in 2005, and in contemporary works published by others about 2004 – 2006. There are also some highly relevant citations from 1995. In short, this is not a new line of thought.)


Motivation for This Work

Friston’s works are arguably some of the more difficult threads-of-consciousness to follow. (For a lighthearted, but sharing-that-frustration blogpost on the same, read S. Alexander’s “God Help Us, Let’s Try to Understand Friston on Free Energy.”)

Thus, there needs to be a reason for putting so much energy an attention into reading Friston.

The reason is that active inference (Friston’s notion) seems to have very good potential for advanced artificial intelligence. (For a good discourse on active inference, especially as a contrast-and-compare with reinforcement learning, consult Noor Sajid et al.’s “Active Inference: Demystified and Compared;” full citation in the end; Blue Squares section. Also, a lot of prior development of active inference, including contrasts with reinforcement learning, are in Friston & Ao, 2012. )

Call it gut-hunch. Call it a sense or intuition that a long-term method for modeling how organisms “organize” their activities, in pursuit of any sort of goal, is best based on free energy minimization – one of the most powerful drivers in the universe.

We believe that Friston’s work advances notions on self-organization, originally introduced by Nicolis & Prigogine (1977), with prior fundamental work by Lars Onsager. (See bio sketches cited in the Resources & References (R&R), end of this post.) These scientists did brilliant work.

There was follow-on work by Walter Freeman and Robert Kozma, which we’re not going to cite here. Then, the introduction of something that looked useful seems to have come with Friston, starting in the early 2000’s.


Backstory

Now, the personal note. More to explain Friston to myself, than for any other reason, I wrote a tutorial. Really, it was written for myself – with no expectation of publishing it in a journal, or getting much attention. This was the arXiv paper that I’ve mentioned a few times, uploaded in 2019. (See cite in R&R section.)

That paper had a misrepresentation of the basic Kullback-Leibler equation – basically, I’d interchanged the meaning of the P and Q notations. So, when it came time to actually USE the equation that I had in the paper – the code produced some truly odd results.

That caused me to go back to my arXiv paper, and from there, back to the sources (most notably, Friston & Beal), and then I discovered the “bloop.”

I fixed the blooper, the code ran beautifully (as expected), and I had some very satisfying results. I published those results recently, in a new arXiv paper. (Again, see cite in R&R; Blue Squares section, as I’ve endeavored to make it as tutorial as possible.)

So, in the process of backtracking – to fix that “bloop” in my 2019 paper, I’ve started with Friston (2013) all over again. That also means picking up prior Friston works, as well as Beal’s 2003 Ph.D. dissertation, etc.


An Introduction Worth Reading

It is really worthwhile to read the Introduction in Friston’s “Life as We Know It,” and then go back and read (at least the Intros and a few other patches of) the works that he cites.

In particular, we find the following useful:

Recent formulations try to explain adaptive behaviour in terms of minimizing an upper (free energy) bound on the surprise (negative log-likelihood) of sensory samples [11,12].”

Extract from the last paragraph of Friston (2013), p.1.

Reference [11] in Friston (2013) is to Friston (2012), and reference [12] is to Friston and Ao (2012). We have found it useful to refer to both of these prior sources. (See full citations in the R&R section at the end; under “Black Diamonds.” Almost all Friston works, with the exception of his 2010 paper, are “Black Diamonds” reading.)

Following the rabbit-hole down to a deeper level, we look at a 2004 paper by Knill & Pouget on “The Bayesian Brain.” This is about the timeframe at which these notions of variational free energy (variational Bayes) applied to brain processes appear; in Friston’s 2005 work and in some contemporary papers; see the references in Friston (2012) & Friston & Ao (2012). (We’re mentioning this just to get a historical trace on the evolution of these thoughts.)


Glossary: Two Key Terms

Unless the reader is very conversant with classical thermodynamics, statistical mechanics, and an array of related disciplines, a number of the terms used will be foreign.

Thus, a brief glossary. This is part of that slowing way down approach that we’re taking here. Today, we address just two terms:

  • Ergodic: The essential notion of ergodicity is that the way that one particle in a system acts, over (a long) time, essentially replicates the behavior of the collection of particles in the system over a very short time (as in, instantly). Here’s a good (lighthearted!) explanation. Sort of long; worth the read: Taylor Pearson (June 2, 2022), “A Big Little Idea Called Ergodicity (Or, the Ultimate Guide to Russian Roulette).” Taylor gives a really useful illustration of the difference between ensemble probability and time probability. Check this one out, it’s very readable – and Taylor also brings in the idea of how an ergodic system is resistant to variations (volatility), etc.
  • Surprise: Friston expresses how surprise works in adaptive behavior by saying that [systems engage in] “adaptive behaviour in terms of minimizing an upper (free energy) bound on the surprise (negative log-likelihood) of sensory samples.” For a very light and non-technical attempt to describe surprise, in the context of free energy minimization in a biological system, read an article by Ross Pain, Michael David Kirchhoff, and Stephen Francis Mann, “‘Life Hates Surprises’: Can an Ambitious ‘Free Energy Principle’ Theory Unify Biology, Neuroscience and Psychology?,” in The Conversation (August 16, 2022). Matthew Bernstein (2020a & b) gives a more useful and technical – yet still very readable (“Bunny Trails”) – approach. (See full reference citations at the end of this post.)

Wrapping It Up: Where Ergodicity and Surprise Fit In

Let’s pull these two notions – ergodicity and surprise – into where we are now, the top of the second page of Friston (2013).

Under ergodic assumptions, the long-term average of surprise is entropy. This means that minimizing free energy—through selectively sampling sensory input—places an upper bound on the entropy or dispersion of sensory states. This enables biological systems to resist the second law of thermodynamics—or more exactly the fluctuation theorem that applies to open systems far from equilibrium [14,15].”

The first part only, of the first paragraph of Friston (2013), p.2.

What we’re talking about so far is a system that will be viewed as two systems – one “internal” and one “external” (see Friston (2013) Abstract and last par. on p.2). These two states are separated by a Markov blanket. (More on that next week.)

The internal states only receive information about the external states via sensory states. (And conversely, the internal states act on the external states only via active states; the sensory and active states define the Markov blanket itself.)

Because the internal states are separate from their surround, we can view them as comprising an ergodic system. That is, they are relatively immune to variations that occur in the external states.

Specifically, Friston opens Section 2 with the statement:

We start with the following lemma: any ergodic random dynamical system that possesses a Markov blanket will appear to actively maintain its structural and dynamical integrity.”

Opening sentence, Sect. 2, of Friston (2013), p.2.

So, we’re looking at a system that is semi-insulated (via a Markov blanket) from its surround. Due to this insulation, it is an ergodic system, and will appear to resist variations due to its external surround. (This ability would allow life to form.)

The adaptation that this system of “internal states” would actual accomplish would be based on sensory inputs (“sensory states”) from the surrounding “external states.” That is, the internal states would have to adapt – but potentially not in a reactive way. Instead, they would perform active inference.

As Friston and Ao (2012) describe this:

We have shown recently that adaptive behaviour can be prescribed by prior expectations about sensory inputs, which action tries to fulfill [6]. This is called active inference and can be implemented, in the context of supervised learning, by exposing agents to an environment that enforces desired motion through state-space [7].”

Opening sentence of second paragraph, Introduction, of Friston & Ao (2012).

Thus, we’re setting the stage for a process – active inference – by which the internal states (the system enclosed by the Markov blanket) attempt to model the external states, and specifically attempt to form prior expectations – that is, the internal states attempt to predict what will happen in its surround, the external states.

Due to their (somewhat) insularity from the external states, the internal states can be considered to be an ergodic system.

Very briefly, and following points made by Matthew Bernstein (2020a & b; see citations in the R&R “Bunny Trails” section), there is a “self-information” function I(p) which describes “surprise.” This function is given as:

I(p) := – log(p),

where

  • I(p) is the self-information function (“surprise”), and
  • p is the probability, or likelihood, of a given event.

(AJM’s Note: Cross-correlate this with the quotation from Friston in the section “An Introduction Worth the Reading,” where he says that the “surprise [is the] (negative log-likelihood) of sensory samples.” Yup; negative log-likelihood – that’s what we’ve got.)

Then, we identify the information entropy H(X) of a random variable X as

H(X) := E[I(P(X))] = -SUM{P(x)log(p(x))},

where

  • E = the expectation or long-run average of the self-information I(P(X)), where the self-information I is taken for the probability of a specific event X.

(See Bernstein, 2020b, for cleaner notation.)

Obviously, the sum in the above equation yields the entropy of the system, which is what Friston stated as “Under ergodic assumptions, the long-term average of surprise is entropy.”

This is a bit long-winded as an introduction, but for those unfamiliar with either or both ergodic theory and/or information theory (particularly the connection of self-information with entropy), it’s a useful start.


Beal and Friston on Markov Blankets

One of the things that differentiates Friston’s work from other treatments of variational inference is that he explicitly puts the variational methods in the context of a system where there is a delineation between an external system (“Psi“) and a representational system (“r“).

Note on Notation: If we were to stay with Friston’s notation, there would be tildes (“~“) above each of these terms.

(AJM’s Note: This conversation will be continued in the next blogpost in this series. We’ve got a publication deadline to hit, so it’s time to hit the “publish” button.)


How to Stay Informed

This is the first in a new blogpost series. We’re anticipating weekly posts, and a few YouTubes as well. To get the word, please do an Opt-In with Themesis.

To do this, go to www.themesis.com/themesis.

(You’re on www.themesis.com right now. You could just hit that “About” button and you’ll be there.)

Scroll down.

There’s an Opt-In form.

DO THAT.

CAPTION: Find the Themesis Opt-In form at: http://www.themesis.com/themesis/

And then, please, follow through with the “confirmation” email – and then train yourself and your system to OPEN the emails, and CLICK THROUGH. That way, you’ll be current with the latest!

Thank you! – AJM



Resources & References

Following the protocol that we introduced in last week’s blogpost, we are now grouping resources & references according to difficulty level – from Bunny Trails to Blue Squares to Double Black Diamond.

Any variational anything (variational autoencoders, variational Bayes, variational inference), and any interpretation of active inference, is automatically AT LEAST a blue square, and the most useful sources are often a double black diamond.

Almost all of Friston’s works are double black. Love the man. Think he’s genius. But that “double-black” rating is just what is so.

I’m putting a couple of early blogposts about Friston in at the Bunny Trails level – these are (typically) safe reads.


Bunny Trails – Decent Introductory Source Stuff

CAPTION: The Bunny. We decided (see Kullback-Leibler 2.5/3, or “Black Diamonds,” that we needed trail markings for the resource materials.

Bunny slopes: the introductory materials. Some are web-based blogposts or tutorials – good starter sets. All Themesis blogposts and YouTubes in this category.


Early Work on Self-Organization

AJM’s Note: What follows are some very good biographical sketches of scientists mentioned earlier (and sometimes cited later in the “Black Diamond” section). These are readable and give good insights into their thoughts as they developed their new insights and theories.

Ilya Prigogine

  • Kondepudi, Dilip, Tomio Petrosky, and John A. Pojman. 2017. Dissipative Structures and Irreversibility in Nature: Celebrating 100th Birth Anniversary of Ilya Prigogine (1917–2003).” Chaos: An Interdisciplinary Journal of Nonlinear Science. 27, 104501. doi: 10.1063/1.5008858. (Accessed Oct. 10, 2022; https://aip.scitation.org/doi/10.1063/1.5008858 )

For a bit about Lars Onsager, read:


The Notion of “Ergodicity”

AJM’s Note: A very accessible introduction to the notion of ergodicity, which is from the realm of thermodynamics / statistical mechanics, is from Taylor Pearson. Interestingly, Taylor works in the financial space – not physics or physical chemistry, and not AI! Still, the best and most gentle-slope intro that we’ve found to ergodicity.

  • Pearson, Taylor. 2022. “A Big Little Idea Called Ergodicity (Or, The Ultimate Guide to Russian Roulette).” Taylor Pearson’s Blogpost Series (www.taylorpearson.me). (June 2, 2022). (Accessed Oct. 16, 2022; https://taylorpearson.me/ergodicity/)

The Notion of “Surprise”

AJM’s Note: For someone who is not an information-theory person, this is an excellent and lucid two-part tutorial on the notion of “surprise.”

  • Bernstein, Matthew N. 2020a. “What Is Information? (Foundations of Information Theory: Part 1)” Matthew Bernstein GitHub Blog Series (June 13, 2020). (Accessed Oct. 11, 2022; https://mbernste.github.io/posts/self_info/ )
  • Bernstein, Matthew N. 2020b. “Information Entropy (Foundations of Information Theory: Part 2)” Matthew Bernstein GitHub Blog Series (August 07, 2020). (Accessed Oct. 11, 2022; https://mbernste.github.io/posts/entropy/ )

AJM’s Note: I like this article for its attempt to explain Friston’s notion of “surprise” in simple terms.


Related Themesis Blogposts

The Kullback-Leibler/Free Energy/Variational Inference Series; just the most recent two, plus the kick-off post for the entire thing:

Prior Blogpost on the Kullback-Leibler Divergence:

Older posts on Friston:

The following (2016) blogpost is useful mostly because it has some links to good tutorial references:


Related Themesis YouTubes


CAPTION: Intermediate Reading/Viewing: Requires preliminary knowledge of both concepts and notation. Not trivially easy, but accessible – often advanced tutorials.

Matthew Beal and David Blei

AJM’s Note: Karl Friston’s 2013 “Life as We Know It” referenced Beal’s 2003 dissertation. Friston’s notation is largely based on Beal’s. Friston introduces the notion of Markov blankets as a key theme in discussing how life (or biological self-organization) necessarily emerges from a “random dynamical system that possesses a Markov blanket.” Beal’s Section 1 discusses both Bayesian probabilities as well as Markov blankets. Reading Beal’s work is a very useful prerequisite for getting into anything by Friston. It helps that Beal does his best to present material in a tutorial style. We’ll start with Markov blankets in the next post.

  • Beal, M. 2003. Variational Algorithms for Approximate Bayesian Inference, Ph.D. Thesis, Gatsby Computational Neuroscience Unit, University College London. (Accessed Oct. 13, 2022; pdf.)

AJM’s Note: I refer to the Blei et al. tutorial because it is very solid and lucid. If we’re trying to understand variational ANYTHING (variational Bayes, variational inference, variational autoencoders, etc.); Blei et al. make some comments and offer a perspective that is very complementary to that given by Beal in his 2003 dissertation.

  • Blei, D.M., A. Kucukelbir, and J.D. McAuliffe. 2016. “Variational Inference: A Review for Statisticians.” arXiv:1601.00670v9 doi:10.48550/1601.00670 (Accessed June 28, 2022; pdf. )

Karl Friston & Colleagues

AJM’s Note: ANYTHING written by Friston is “double black diamond.” That said, a few papers are a BIT more accessible than others.

AJM’s Note: Friston’s 2010 paper is largely conceptual. (Meaning, equations-free.) Not to be blown off; he’s establishing the context for future works.

  • Friston, K. 2010. “The Free-Energy Principle: A Unified Brain Theory?” Nature Reviews Neuroscience11 (2), 127-138. (Accessed Oct. 13, 2022; online access.)

A More Advanced Treatment of Ergodicity


Kullback & Leibler – Orig. Work

AJM’s Note: Kullback and Leibler. Their original paper. The one that started all of this.

AJM Papers

AJM’s Note: My arXiv paper – the one that needs to be revised. The one that still has the P and Q notation scrambled. (This will be fixed. Soon.)

AJM’s Note: The recent one – using somewhat of a variational method.

  • Maren, Alianna J. 2022. “A Variational Approach to Parameter Estimation for Characterizing 2-D Cluster Variation Method Topographies.” arXiv:2209.04087v1[cs.NE] (Themesis Technical Report THM TR2022-001 (ajm).) doi:10.4855/arXiv.2209.04087. https://arxiv.org/abs/2209.04087 )

Books – Especially Classics / Good-to-Read

AJM’s Introductory Note: Some of you want to do a systematic, strong study of fundamentals. These are books that are frequently-cited, and people actually DO read them.

AJM’s Note re/ Feynmann: Feynmann is noted for his exceptionally lucid presentations. High school students have been known to read his books. (Brilliant high school students, of course – and looking for status points when video games aren’t enough.)

  • Feynman, R.P. 1972, 1998. Statistical Mechanics: A Set of Lectures. Reading, MA: Addison-Wesley; Amazon book listing.

AJM’s Note re/ Sethna: What has this book on my list is a Sethna comment on p. 3 of his book, “Science grows through accretion, but becomes potent through distillation.“ What a great way to express our understanding of science! (See the end of Kullback-Leibler – Part 1.5 of 3 for my original reference to him.)


CAPTION: Double Black Diamond: Expert-only! These books, tutorials, blogposts, and vids are best read and watched AFTER you’ve spent a solid time mastering fundamentals. Otherwise, a good way to not only feel lost, but hugely insecure.

Friston & Co.

AJM’s Note: This Friston (2005) paper is his most-cited paper for the his personal genesis of active inference, and seems to be the earliest where he presents a fully-fleshed notion of how “both inference and learning rest on minimizing the brain’s free energy, as defined in statistical physics.” He refers also to a Hinton et al. (1995) paper, but several papers published between 2004 – 2006 establish the genesis timeframe for Bayesian interpretations of perception.

AJM’s Note: This paper by Knill & Pouget (2004) was published just prior to Friston’s 2005 paper; both dealing with Bayesian modeling of brain processes. Friston cites this in his 2012 works.

AJM’s Notes: These two Friston papers are useful and important predecessors to Friston (2013). These two, in turn, also cite useful and important predecessor works – by both Friston and colleagues as well as others. (See above.) It’s still TBD as to how deep we need to go in reading back into the earliest works, in order to understand the ones addressed in this (blogpost) course of study.

Active Inference: perhaps the most accessible presentation, by Noor Sajid & colleagues (first recommended in the Themesis June, 2022 blogpost Major Blooper – Coffee Reward:

  • Sajid, N., Philip J. Ball, Thomas Parr, and Karl J. Friston. 2020. “Active Inference: Demystified and Compared.” arXiv:1909.10863v3 [cs.AI] 30 Oct 2020. (Accessed 17 June 2022; https://arxiv.org/abs/1909.10863 )

AJM’s Note: Friston’s 2013 paper is the central point for theoretical (and mathematical) development of his notions on free energy in the brain, and in any living system. He starts with the notion of a system separated by Markov boundary from its external environment. Moves on from there. The forthcoming blogpost series will focus on this paper.

  • Friston, Karl. 2013. “Life as We Know It.” Journal of The Royal Society Interface. 10. doi:10.1098/rsif.2013.0475. (Accessed Oct. 13, 2022; pdf.)

AJM’s Note: Friston and colleagues, in their 2015 paper “Knowing One’s Place,” show how self-assembly (or self-organization) can arise out of variational free energy minimization. Very interesting read!

  • Friston, K.; Levin, M.; Sengupta, B.; Pezzulo, G. 2015. “Knowing One’s Place: A Free-Energy Approach to Pattern Regulation.” J. R. Soc. Interface12:20141383. doi:10.1098/rsif.2014.1383. (Accessed Oct. 3, 2022; pdf.)

… And Others …

Nicolis, Gregoire, and Ilya Prigogine. (1977). Self-Organization in Nonequilibrium Systems: From Dissipative Structures to Order through Fluctuations. (New York: Wiley)

Possible follow-ups:

  • Palm, G. 1981. “Evidence, Information, and Surprise.” Biol. Cybernetics (1 Nov. 1981) Corpus ID: 8828009 DOI:10.1007/BF00335160


… And Music to Go with This Week’s Theme …

It’s got to be love that drives us into this substantive study. Can’t think of any other reason.

The new theme song is: Let’s Talk about Love, written by Bryan Adams, Eliot Kennedy and Jean-Jacques Goldman, and released by Celine Dion in 1997.


6 comments

  1. Delighted, as always, to have your input, Simon!
    Yes, two very good articles! I’ll add them to the “Resources and References” in next week’s post – and thank you! – AJM

Comments are closed.

Share via
Copy link
Powered by Social Snap