One of the most important things for us to understand, as we come into the “deep learning” aspect of AI (for the first time), is the relationship between backpropagation and the (restricted) Boltzmann machines, which we know comprise the essential core of various “deep learning” architectures.
The essential idea in deep architectures is this:
- Each pairwise set of layers in a “deep” architecture is typically trained using a generative method, such as contrastive divergence. Each pair of layers is really an autoencoder, where the “top layer” comprises the latent variables, or features, within the patterns presented for training. The patterns themselves are the “bottom layer,” or the known (observed) variables. The notion of having “top” and “bottom” layers is strictly subjective; it’s an artifice that we use to later structure these two sets (the latent and the observed variables) into a layered architecture.
- Once the basic connection weights have been found, they can be “refined” using backpropagation, on a layer-by-layer basis.
- Contrastive divergence is a generative method. That means that contrastive divergence can “find” the latent variables inherent in a training set of diverse patterns, and especially find those latent variables (features) that are useful in distinguishing certain patterns from each other – without being told in advance which “category” a pattern belongs to.
- Backpropagation is a discriminative method, and it works with situations where we know in advance which “category” a pattern belongs to. Backpropagation uses that knowledge to adapt weights so that each pattern winds up being associated with the “right category” for that pattern.
- When we use contrastive divergence to find a set of “latent variables” associated with an input pattern, those latent variables themselves become – in a sense – “categories.” These “categories” (of latent variables) are used, layer-by-layer, when refining the connection weights later using backpropagation. The categories become (typically) more and more abstract as we progress up the layers of a deep architecture.
This is, of course, hugely simplified. However, these points capture the gist of the two learning methods, both of which are used in deep architectures.
YouTubes for Learning the Backpropagation Method
One easy place to start is my series of YouTube vids on my personal (Alianna Maren) YouTube channel, with the “Neural Network: Backpropagation (The Series)” playlist, which I published in 2020 on the Alianna Maren YouTube channel:
https://www.youtube.com/playlist?list=PLUf2R_am1DRK9FYP6tfSEUV23qKEv4I3S
The first vid in that series is HERE:
Deep Learning Architectures are Based on Contrastive Divergence Learning, or Boltzmann Machines
It’s important to understand that there is a BIG difference between how the backpropagation algorithm, working on a Multilayer Perceptron architecture, works vis-a-vis the (restricted) Boltzmann machine algorithm, where the actual training algorithm is contrastive divergence. These are EXTREMELY DIFFERENT beasts, even if they look somewhat the same – or can be forced to look somewhat the same.
Many people (perhaps most), coming into the AI area for the first time, do not understand how the Boltzmann machine (and contrastive divergence) work. That’s because the Boltzmann machine – and its derivative, the restricted Boltzmann machine, or RBM – both involve energy-based physics, and that is a course of study unto itself.
Properly done, we would first spend time studying and internalizing statistical physics (this is what I mean by energy-based physics). The goal would be to get at least as far as the Ising equation (a basic stat phys model).
Then, we would and then follow up with a study of Bayesian logic, and then (and ONLY then) see how the (restricted) Boltzmann machine (RBM) is created by inserting Bayesian logic into a stat phys (Ising) model, which is as weird as weird can be.
The thing is – unless you understand what is really being done there, your knowledge will be superficial and sketchy.
It’s ok, when we start out, to be superficial and sketchy, so long as we each know that’s what’s going on.
The smart thing is to treat “superficial and sketchy” as having a look at a map of the energy-based AI terrain, and as a quick overview of what you’ll be dealing with when you actually get into that study, which would be something that you’d do not only after you’re done with this course, but likely after you’re done with this program.
Now, all that said … the two methods; backpropagation plus contrastive divergence (meaning the restricted Boltzmann machine architecture and learning) WORK TOGETHER in deep learning architectures.
It is very important to understand and internalize this.
If you have the motivation and energy, START HERE, and then look at the rest of the vids in the Themesis, Inc. YouTube playlist:
You can get an understanding for the role of statistical physics in energy-based neural networks (such as Boltzmann machines), even if you don’t (yet) understand the restricted Boltzmann machine, or how contrastive divergence works.
It is an extreme error to go forward thinking that deep learning architectures are just stacks of backpropagation-trained layers. That would be a very bad misconception, and would be horribly embarrassing going forward.
How Backpropagation Works with Contrastive Divergence (RBMs) in Deep Architectures
RBMs and Backpropagation and Deep Learning – YouTubes
Have a look at Hinton’s discussion HERE, where he talks about Using Backpropagation for Fine-Tuning a Generative Model. Pay attention around 8 min 30 sec and going forward.
It is ok to absorb this superficially for right now, just so that you’re aware that this is something that you WANT to absorb, going down the road.
RBMs and Backpropagation – Key Papers
RBMs and Backpropagation and Deep Learning – Papers
After you’ve watched (or speed-skimmed) that hour-long tutorial, have a look at THIS crucial paper.
Salakhutdinov, R. and G. Hinton (2012). “An Efficient Learning Procedure for Deep Boltzmann Machines,” Neural Computation, 24 no. 8 (August): 1967-2006. Accessed Jan. 5, 2022. https://www.cs.cmu.edu/~rsalakhu/papers/neco_DBM.pdf
Look at the following sentence at the bottom of p. 1994 (this is journal pages, not the length of the whole article!), in Sect. 4.2. It reads:
Standard backpropagation of error derivatives can then be used to discriminatively fine-tune the model.
Salakhutdinov and Hinton, 2012
The same sentiment is expressed in this article, by the same authors, published about the same time.
Salakhutdinov, R. and G. Hinton (2009). “Deep Boltzmann Machines.” Journal of Machine Learning Research 5 no. 2: 448-455. Accessed Jan. 5, 2022. https://proceedings.mlr.press/v5/salakhutdinov09a/salakhutdinov09a.pdf. (Links to an external site.)
After the subsequent discriminative fine-tuning, the “unrolled” DBM achieves a misclassification error rate of 10.8% on the full test set. … After learning a good generative model, the discriminative fine-tuning (using only the 24300 labeled training examples without any translation) reduces the misclassification error down to 7.2%. (Section 4.2, p. 455)
Salakhutdinov and Hinton, 2009
References
- Salakhutdinov, R. and G. Hinton (2012). “An Efficient Learning Procedure for Deep Boltzmann Machines,” Neural Computation, 24 no. 8 (August): 1967-2006. Accessed Jan. 5, 2022. Accessed on Jan. 5, 2022. https://www.cs.cmu.edu/~rsalakhu/papers/neco_DBM.pdf
- Salakhutdinov, R. and G. Hinton (2009). “Deep Boltzmann Machines.” Journal of Machine Learning Research 5 no. 2: 448-455. Accessed Jan. 5, 2022. https://proceedings.mlr.press/v5/salakhutdinov09a/salakhutdinov09a.pdf
- Maren, Alianna J. (2018). “Generative vs. Discriminative – Where It All Began.” Accessed Jan. 5, 2022. https://www.aliannajmaren.com/2018/07/10/generative-vs-discriminative-where-it-all-began/
Reference for Chicago Style (Author-Date) Formatting
A good reference for your Chicago-style (Author-Date) formatting, used to format the above references, is:
https://www.chicagomanualofstyle.org/tools_citationguide/citation-guide-2.html