Latent Variables in Neural Networks and Machine Learning

Latent variables are one of the most important concepts in both energy-based neural networks (the restricted Boltzmann machine and everything that descends from it), as well as key natural language processing (NLP) algorithms such as LDA (latent Dirichlet allocation), all forms of transformers, and machine learning methods such as variational inference. The notion of finding an appropriate set of latent variables is central to variational autoencoders.

The challenge of finding appropriate latent variables is more pronounced as we seek improved methods by which systems (particularly variational autoencoders) can do self-supervised learning, meaning that their dependence on a labeled training data set is greatly reduced, while they need to extract the essence from a large collection of unlabeled data. This has become a dominant challenge in machine learning.

This post was inspired by two recent arXiv papers on latent variables, and also by Yann LeCun’s 2020 ICRA Plenary Lecture; all three of these addressed the topics of latent variables and self-supervised learning:

  • A new approach to the problem of latent variable non-identifiability, developed by Yang, Blei, and Cunningham (2023),
  • A comprehensive review of self-supervised learning (Gui et al., 2023), and (of course)
  • The 2020 ICRA Plenary Lecture by Yann LeCun, leading to a focus on self-supervised learning.

The Starting Point: Yann LeCun on Self-Supervised Learning

The most useful way to orient ourselves for this micro-study in latent variables and self-supervised learning is the ICRA 2020 talk “Self-Supervised Learning and World Models,” given by Yann LeCun.

We begin our study with a summary of LeCun’s major points.

LeCun, Yann. 2020. “Self-Supervised Learning and World Models.” ICRA 2020 Plenary Talk, presented on the IEEE Robotics and Automation Society YouTube Channel (Original presentation on June 2, 2020; YouTube upload in 2021). (Accessed July 13, 2023, available online at https://www.youtube.com/watch?v=eZo1zEepWc0.)

LeCun clearly identified the “three problems [that] the community must solve” as:

  • Learning with fewer layered samples and/or fewer trials,
  • Learning to reason, and
  • Learning to plan complex action sequences.

LeCun begins with a quick review of two well-known methods; deep learning (DL) and convolutional neural networks (CNNs) for image analysis (as well as many related tasks), and reinforcement learning (RL). This covers the first 10-11 minutes. He then transitions into what we know about human/biological learning.

At twelve (12) minutes into his talk, LeCun gets to the heart of the matter, which he sums as: “The revolution will not be supervised (nor purely reinforced).” (Catchy slogan, hmm?) (And as a cute note: the Jitenda Malik quote that he puts at the bottom of the slide is: “Labels are the opium of the machine learning researcher.”)

All of this leads to the kick-off of his primary theme of “self-supervised learning” (beginning at 12:09). He sums the uses of self-supervised learning (SSL) as:

  • Learning hierarchical representations of the world, and
  • Learning predictive (forward) models of the world.

The question is then (paraphrased): “How to represent uncertainty/ multimodality in the prediction?”

LeCun uses this to introduce inference and multimodal predictions through constraint relaxation.

He then moves into the meat of the matter – energy-based models, which he notes are used for inference (not for learning). And from here, he introduces the notion of latent variables as a means to parameterize the set of predictions. He states that “Ideally, the latent variable represents independent explanatory factors of the prediction” (18:17). He also notes that “The information capacity of the latent variable must be minimized. (Otherwise, all the information for the prediction will go into it.)”

By the following slide, LeCun makes it very clear that he is dealing with latent-variable energy-based models (EBM) [for] inference.

Sidebar: Variational Inference (Resources)

By now, it’s clear that this entire conversation is pivoting in the direction of variational inference.

Actually, though, it is less about variational inference and more about variational autoencoders.

Still, variational inference the best way to ease into variational autoencoders. This is because variational autoencoders combine notions from two ways in which energy-based methods are used in neural networks and machine learning:

  • Energy-based neural networks, specifically, the (Little-)Hopfield neural network, the Boltzmann machine, and all of its derivatives and descendants, and
  • Variational methods, which seek to find parameters that allow for the best match between a model and a data set.

The following Figure 1 shows how certain “essential concepts” are used across a wide range of neural networks and machine learning methods. We can see that latent variables, as well as Bayesian probabilities and the notion of “free energy” from statistical mechanics, are common to both energy-based methods (the Boltzmann machine and its derivatives) as well as variational methods.

However, even though the notion of latent variables is ubiquitous, this notion can (and must) be interpreted in different ways, depending on the algorithm used. Further, there are surprising and subtle challenges in correctly finding and using appropriate latent variables.

Figure 1. Fundamental components of major AI/ML (machine learning) methods. Note that variational inference (second column to the right) uses free energy in a “basic” sense, whereas the more specific Ising model is used across a suite of neural network methods, ranging from the Boltzmann machine up to various NLP methods. Image taken from the Themesis YouTube (2023), “The Matrix: Framework for a New Class of Neural Networks.” (See YouTube link below.)

Figure 1 is taken from the Themesis YouTube (see below), “The Matrix: Framework for a New Class of Neural Networks.” It presents an overview, developed as “the Matrix,” which correlates major concepts in AI/ML with specific algorithms/methods.

Maren, Alianna J. 2023. “The Matrix: Framework for a New Class of Neural Networks.” Themesis YouTube Channel (June 22, 2023). (Accessed July 13, 2023; available online at https://www.youtube.com/watch?v=gVuC8we6f1M&t=280s.)

The logical starting place for understanding variational inference is the Kullback-Leibler (KL) divergence. The KL divergence can be rewritten two different ways, one of which is in the form of a free energy equation, which is useful for variational inference.

AJM’s Note: All references to “energy-based methods,” whether they use the Ising equation as their substrate (particular to Boltzmann machines and their derivatives), or variational methods, use the free energy equation from statistical mechanics. The primary difference is that for the energy-based neural networks, the parameters are the specific connection weights between the hidden and visible nodes in the neural network. This is what LeCun refers to when he says that the topic that he’s addressing here is “inference (not for learning).” (See second-to-last paragraph in the preceding section.) For variational methods, the same underlying notion of a free energy equation is used, it’s just that the parameters that can be adjusted have to do with matching the model to the data, not to specific connection weights in a neural network.

For someone who is new to the realm of variational methods, Themesis has a baker’s dozen (total of thirteen) blogposts that take one from the KL divergence through free energy to variational inference and active inference (a notion proposed by Karl Friston). The starting point is HERE. The final post in that series, which is a RESOURCE COMPENDIUM, organizes all the references and resources presented in the prior twelve blogposts, according to difficulty and topic. Full references (and again, the links) for these blogposts are in the Resources and References section, under Prior Related Blogposts, at the end of this post.


Back to LeCun’s Talk …

Time constraints necessitate a cut-off.

We’ll pick up again in a future post, wrapping up a few more points from LeCun, then taking a look at that excellent review article by Gui et al., and then investigating at what Yang, Blei, and Cunningham pose as a reasonable solution for the “latent variable non-identifiability” problem, which is a core challenge for self-supervised learning.

Until then, for those who want to do a little further study, the following two Medium.com articles on variational inference vs. variational autoencoders may prove useful:

Both of these articles offer useful insights and perspectives, and are accessible at an easy-to-read level.

(Full references to both in the References section, at the end of this post.)


“Live free or die,”* my friend!

Alianna J. Maren, Ph.D.

Founder and Chief Scientist, Themesis, Inc.

  • “Live free or die. Death is not the worst of evils.” Attrib. to U.S. Revolutionary War General John Starck.

Resources and References

Prior Related Blogposts

AJM’s Note: This is the STARTING POINT for the twelve-blog series on the KL divergence, free energy, and variational (and active) inference.

AJM’s Note: This is the RESOURCE COMPENDIUM, presented as a follow-up (making the series a baker’s dozen), for the Themesis blogpost series on the KL divergence, etc. This blogpost series serves as a full tutorial sequence for someone who wants to self-educate on these related topics.


References

AJM’s Note: This is an excellent and detailed review of SSL methods (over 250 references; well-organized; leads with reference to the LeCun talk summarized at the beginning of this post).

  • Gui, Jie, Tuo Chen, Qiong Cao, Zhenan Sun, Hao Luo, and Dacheng Tao. 2023. “A Survey of Self-Supervised Learning from Multiple Perspectives: Algorithms, Theory, Applications and Future Trends.” arXiv:2301.05712v1 [cs.LG] (13 Jan 2023). (Accessed July 9, 2023; available online at https://arxiv.org/pdf/2301.05712.pdf.)

AJM’s Note: David Blei, who (with Andrew Ng and Michael Jordan) was one of the latent Dirichlet allocation (LDA) inventors, and who was also senior author on an important variational inference review (see prior blogpost for details), has continued his interest in latent variables. This paper introduces a new method to deal with the problem of “latent variable non-identifiability,” which is central to the issue of working with variational inference and self-supervised learning.

The lead author for this work is Yixin Yang, who did her doctoral work with David Blei, and is now Assistant Professor of Statistics at the University of Michigan, and also a faculty affiliate of theĀ Michigan Institute for Data Science (MIDAS).

  • Wang, Yixin, David M. Blei, and John P. Cunningham. 2023. “Posterior Collapse and Latent Variable Non-identifiability.” arXiv:2301.00537v1 [stat.ML] (2 Jan 2023). (Accessed July 9, 2023; available online at arXiv:2301.00537 [stat.ML].)

AJM’s Note: The above citation is for the expanded version of the original paper, and contains additional explanatory comments, proofs, and experimental results as well as additional references. The original NIPS article was published in 2021.

AJM’s Note: The following two articles, doing a contrast-and-compare between variational inference and variational autoencoders, are a good preparation for moving deeper into the subject.

Share via
Copy link
Powered by Social Snap