Entropy in Energy-Based Neural Networks – Seven Key Papers (Part 3 of 3)

If you’ve looked at some classic papers in energy-based neural networks (e.g., the Hopfield neural network, the Boltzmann machine, the restricted Boltzmann machine, and all forms of deep learning), you’ll see that they don’t use the word “entropy.”

At the same time, we’ve stated that entropy is a fundamental concept in these energy-based neural networks.

So, what’s going on?

How can such a fundamental and important notion be simply not mentioned in the most significant papers in this area?

In this post, we examine how the notion of entropy shows up in energy-based neural networks. Typically, it is addressed in terms of “Gibbs sampling” or “Markov chains.” That is, the author(s) address the notion of entropy implicitly – and put their attention on the pragmatics of introducing entropy into their neural network training.

We will find that the notion of entropy comes about indirectly. That is – the essential concept underlying entropy – that of maximizing distribution over all possible states – is very much a part of energy-based neural networks. However the term “entropy” itself is rarely used. Instead, we see discussions on how entropy is represented in a pragmatic way. These discussions focus on Gibbs sampling and Markov chains.
Alianna J. Maren

Figure 1. We examine the role of entropy in energy-based neural networks, and specifically how the notion of entropy is pragmatically addressed in terms of Gibbs sampling and Markov chains. Part 3 in a series of three blogposts on “Seven Key Papers in Energy-Based Neural Networks.” See References for citations and links to the previous two blogposts in the series.

This post is the last three posts in our series on “Seven Key Papers on Energy-Based Neural Networks.” This last one in this series focuses on the role of entropy in neural networks. The first post introduced the notion of statistical mechanics as foundational to (energy-based) neural networks, and discussed the Little-Hopfield neural network. The second post showed how the limitations of the Little-Hopfield neural network were overcome in the Boltzmann machine, through introducing latent variables. (Full blogpost citations are given in the References section at the end of this post. All references are formatted in Chicago Author-Date style.)

A Little Background and Context

We’ve previously introduced the notion of the “Donner Pass of AI,” which is where several disciplines meet up, as shown in the following Figure 2.

Figure 2. The “Donner Pass of AI” is where several disciplines join up to form a confluence of ideas that together are foundational for energy-based artificial intelligence. The key discipline is statistical mechanics. However, Bayesian probability theory, together with various sampling methods (e.g., Gibbs sampling, Markov chains) also contribute.

The essential thing to understand is that energy-based AI uses statistical mechanics as a framework. Specifically, it relies on the Ising model of free energy, which is a very famous (and is also the simplest) statistical mechanics-based free energy model.

Free energy is a linear combination of two terms: an energy (or enthalpy) term, and entropy.

Taking an excerpt from a previous blogpost:

The reduced free energy equation is shown below, where all terms (G, H, and S) have been divided through by temperature, T, and by Nk, which is the total number of units in the system (N) and Boltzmann’s constant (k). The “bar” over certain terms means that they are “reduced” by the division just described.

Our free energy equation looks like

\bar{G} = G/NkT = \bar{H} - \bar{S}. — The reduced free energy equation, where key terms (G, H, and S) have been divided through by temperature, T, and by Nk, which is the total number of units in the system (N) and Boltzmann’s constant (k).

This tells us that our (reduced) free energy is a combination of two terms. The first is the (reduced) enthalpy, which is the energy contributions both from units being in their various activation states, as well as from their interactions with each other. The second is the entropy.
From “Wrapping Our Heads around Entropy.” Alianna J. Maren. 2018. Alianna J. Maren blogpost series. (Feb. 13, 2018). (Accessed Mar. 28, 2022.)

In a previous YouTube video (see below) we discussed how, in energy-based AI, we replace the mass-less and volume-less point particles of the Ising model with neural network nodes. The key correlation is that in both the statistical mechanics model and in neural networks, these particle/nodes could be in one of two possible “energy states.”

In the Ising model, the energy of the system (the enthalpy, if you’re a physical chemist) is computed by any one of several different methods, but they all ascribe a simple (approximate) interaction energy to all particles in the “on” energy state.

In neural networks, the key (and radical!) notion is that there is not a single “one-size-fits-all” interaction energy value for each “on” node. Instead, the interaction energies are computed, individually and discretely, for each possible node-pair. (This is the case in the Hopfield neural network and the simple Boltzmann machine; it is a little more restricted for the restricted Boltzmann machine.) See the following YouTube for a description of the difference between these two approaches.

In short, previous Themesis YouTubes (the one above, and others in that Playlist) have focused on the energy (enthalpy) term.

At no point, prior to now, have we discussed how the entropy term shows up in the neural networks.

Also, if we look at the three remaining papers in our series of “Seven Key Papers,” we find that they don’t mention entropy, either.

What they DO mention are two ways of establishing a (training) data set:

Gibbs Sampling, and

Markov chains.

There is also a good deal of discussion about Bayesian probabilities.

The key insight is this: In statistical mechanics, the entropy term ensures that we give consideration to all possible distributions of nodes (particles) over all possible energy states. When we find the free energy minimum (the fundamental process in statistical mechanics), we typically find that the location of this minimum is strongly influenced by the bowl-shaped negEntropy (negative entropy) term.

The following Figure 3 shows the negative entropy for a system that has particles in just two energy states.

Figure 3. The negEntropy (negative entropy) for a system composed of particles that can be in one of two possible energy states. The negEntropy is bowl-shaped, and – if the energy of the system were zero – the center of this bowl would give us the location of the free energy minimum for the system. Figure taken from

In contrast, in when we take the statistical mechanics notions over to energy-based neural networks, we don’t have a simple equation to give us the “distribution of nodes over all possible states.” Just as we had to create an algorithm that would find values for each individual connection weight, we now need to find each element of the “distribution.” This means, we need to assemble a training (and testing) data set.

You’ll see, from each of these three papers (shown in Figure 1, and also given in the References), that a great deal of attention is paid to creating the training and testing data set. The term “entropy” may not be mentioned explicitly. In the context of these papers, which are written by physicists FOR physicists, the authors assume that the readers already know about the importance of entropy. They assume that this concept is so familiar, and so fundamental, that they don’t need to even discuss it. Instead, they focus on the pragmatics of creating that all important “distribution over all possible states.”

To do this, they turn to well-known methods: Gibbs sampling and Markov chains.

We’ll discuss these methods in later posts. Before we do, you may find it interesting (and useful) to briefly look at Hinton’s 2002 paper, in which he describes creating the data set that is a “product of experts.” This is actually a very useful read, in that it gives you insight into how Hinton was thinking as he evolved the notions of training a (restricted) Boltzmann machine.

How Authors of These Papers Explain Entropy (or Not)

The three papers that we’re using here are authored either by Geoffrey Hinton (sole author) or by Hinton and Ruslan Salakhutdinov. (See References for each of these papers, with links.) These papers all discuss – at their core – energy-based neural networks.

If you’d rather watch a vid than read a paper, the series of videos presented by Geoffrey Hinton is a good starting place. Here’s an early one in the series, where he discusses the Hopfield neural network. (Hinton, Geoffrey E. 2016. “Lecture 11A : Hopfield Nets.” See full citation in References.)

This blogpost will be continued, as we take a deeper look into the mechanisms by which the data sets are created, and the guiding thoughts taken from Bayesian probabilities, Gibbs sampling, and Markov chains.

To your health and outstanding success!

Alianna J. Maren, Ph.D.

Founder and Chief Scientist, Themesis, Inc.

References

Salakhutdinov, Ruslan, and Geoffrey Hinton. 2012. “An Efficient Learning Procedure for Deep Boltzmann Machines.” Neural Computation 24(8) (August, 2012): 1967–2006. doi: 10.1162/NECO_a_00311. (Accessed April 3, 2022.) https://www.cs.cmu.edu/~rsalakhu/papers/neco_DBM.pdf

Hinton, G.E., and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313(5786) (July 28, 2006): 504-507. doi: 10.1126/science.1127647. (Accessed April 5, 2022.) https://www.cs.toronto.edu/~hinton/science.pdf

Hinton, Geoffrey E. 2002. “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14(8) (August, 2002):1771-800. doi: 10.1162/089976602760128018 (Accessed April 3, 2022.) https://www.cs.toronto.edu/~hinton/absps/tr00-004.pdf

Previous Related Blogs

Maren, Alianna J. (2021). “Latent Variables Enabled Effective Energy-Based Neural Networks: Seven Key Papers (Part 2 of 3).” Themesis Blogpost Series (November 16, 2021). (Accessed April 3, 2022.) https://themesis.com/2021/11/16/latent-variables-enabled-effective-energy-based-neural-networks-seven-key-papers-part-2-of-3/

Maren, Alianna J. (2021). “Seven Key Papers for Energy-Based Neural Networks and Deep Learning (Part 1 of 3).” Themesis Blogpost Series (November 5, 2021). (Accessed April 3, 2022.) https://themesis.com/2021/11/05/seven-key-papers-part-1-of-3/

Maren, Alianna J. 2018. “What We Really Need to Know about Entropy.” Alianna J. Maren blogpost series. (Feb. 28, 2018). (Accessed Mar. 28, 2022.) http://www.aliannajmaren.com/2018/02/28/what-we-really-need-to-know-about-entropy/

Maren, Alianna J. 2018. “Wrapping Our Heads around Entropy.” Alianna J. Maren blogpost series. (Feb. 13, 2018). (Accessed Mar. 28, 2022.) http://www.aliannajmaren.com/2018/02/13/wrapping-our-heads-around-entropy/

Previous Related YouTubes

Maren, A.J. 2021. “Statistical Mechanics: Foundational to Artificial Intelligence.” Themesis YouTube Channel. (Aug. 26, 2021.) (Accessed March 27, 2022.) https://www.youtube.com/watch?v=y9ajEXdx54Q&t=2s

Maren, A.J. 2021. “Statistical Physics Underlying Energy-Based Neural Networks.” Themesis YouTube Channel. (Aug 26, 2021.) (Accessed April 3, 2022.) https://www.youtube.com/watch?v=ZazyMS-IDg8&t=163s

Hinton, Geoffrey E. 2016. “Lecture 11A : Hopfield Nets.” YouTube: Neural Networks for Machine Learning by Geoffrey Hinton [Coursera 2013]. (Dec 19, 2016) (November 5, 2021). (Accessed April 3, 2022.) https://www.youtube.com/watch?v=cLuuAjvawhQ

Good Vibes

If I want to connect in with the time of Cleopatra (see below), thanks to the offerings of YouTube, I can listen to some recreations (in spirit if not exactitude) of music of that time: “Ancient Egyptian Music – Cleopatra.” Fantasy & World Music by the Fiechters.

A longer (3-hr) soundtrack is offered by Lantern.

Famous Salonnières, High Priestesses, Professors, and Pharaohs!

Cleopatra (Cleopatra VII Philopator), Queen of Egypt from 51 to 30 BC, is a little different from those mentioned previously in our “Salonnière” Series. To the best of our knowledge, she never hosted a salon.

Cleopatra was the last in the Ptolemaic dynasty, founded in 305 BC by Ptolemy I Soter, a Macedonian who was one of Alexander the Great’s top generals. When Alexander died, Ptolemy took control of Egypt and founded a dynasty that ruled for over 200 years. Cleopatra was his last descendent to rule Egypt before it was conquered by the Romans.

Cleopatra was highly intelligent. She spoke several languages, and was well-versed, most of all, in politics and strategies for survival. Her life was tumultuous.

However, had she lived in calmer, more benign times, it is likely that she would have carried on in the Ptolemaic dynastic tradition of supporting the arts and sciences. Her dynastic founder, Ptolemy I Soter, also founded the Museum of Alexandria (also known as the Greek Mouseion, “Seat of the Muses”) in 283 BC. This Museum (which continued until the time of Hypatia), became the home of the Royal Library of Alexandria.

The Ptolemies brought in scientists and researchers from all over the world, and the Museum was the primary center of learning in its time. However, in later years, her ancestor Ptolemy VIII Physcon actually purged the Library of intellectuals. In short, her entire dynasty initially made a point of fostering artistic and intellectual creativity as well as high scholarship. Then, it went through an active period of diminishing the role of intellectual work. By the time that Cleopatra reigned, over a hundred years of political intrigue and warfare and shifted the focus of the Ptolemy rulers. Nevertheless, the Library – although damaged by a fire set by Julius Caesar’s ships – continued to be an intellectual center that was central to the world at that time. (See this account of the Library.)

Salons – however – moved. The great thinkers left Alexandria, due to purges from Ptolemy VIII Physcon, and found refuge in different cities, such as Athens and Rhodes.

The lesson for us here is: follow the great minds to wherever they are. An unstable political environment can dramatically affect the venue of intellectual hubs.

3 comments

Simon Crase says:

April 7, 2022 at 7:03 am

The Wikipedia page on Ptolemy VIII Physcon states that Φύσκων was Greek for “Fatty”. I doubt whether anyone said “Ptolemy the Fatty”.in his hearing.
1. AJ Maren says:
  
  April 9, 2022 at 4:46 pm
  
  Love your comment, Simon! Thanks for the dive into Ptolemy VIII’s added name – what a fun find! (And I love that you gave us the Greek spelling as well!)
  
  I bet that you’re right – NO ONE (in their sane mind) would have let him hear that!
  
  So fun to have you chiming in, thanks again! – AJM
Pingback: When a Classifier Acts as an Autoencoder, and an Autoencoder Acts as a Classifier - Themesis, Inc.

Comments are closed.