Twelve Days of Generative AI: Day 3 – Transformers and Beyond

In this blogpost, we investigate how transformer-based architectures actually implement the three key elements of generative AI:

The REVERSE Kullback-Leibler divergence,
Bayesian conditional probabilities, and
Statistical mechanics (minimizing a free energy equation).

Figure 1. All generative AI methods incorporate three key elements: (1) the reverse Kullback-Leibler divergence, (2) Bayesian conditional probabilities, and (3) statistical mechanics (minimizing a free energy equation).

The Easiest Entry Point

I love Matthias Bal’s blogposts that put transformers into a statistical mechanics context – here’s one of my favorites, and possibly the EASIEST to read.

And truly, you don’t have to read the whole thing – a light skim will give you the “big ideas.” This is the blogpost on “An Energy-Based Perspective on Attention Mechanisms in Transformers.” (See full citation for this and other works in the References and Resources section, under Blogposts by Others.)

What is particularly useful in this tutorial is the connection that Bal makes between transformers and the Hopfield neural network, which we discussed in the previous blogpost, “Day 2 – Autoencoders.”

Transformers and the Kullback-Leibler Divergence

We have to dig a bit to find how transformers use the (reverse) Kullback-Leibler divergence. Here are a few articles and/or blogposts that identify this connection.

Example 1: Matthias Bal, “Spin-Model Transformers”

Here’s a section from one of Matthias Bal’s blogpost where he introduces this connection:

“The factorized model Q(st|θt∗) that minimizes the Kullback-Leibler (KL) divergence … (see the figure below for a version of the reverse KL divergence) … has mean magnetizations mt identical to those of the target distribution P(st) since, for all spins i=1,2,…,N, we find that … (and more equations follow).“

And note that – even when an author simply uses the phrase “Kullback-Leibler divergence,” in the context of generative AI, they always MEAN the “REVERSE Kullback-Leibler divergence.”

Here’s another Matthias Bal quote, this time from “An Energy-Based Perspective …”:

“For any query ξ∈Rd, or state pattern, we want to find a way to retrieve the closest stored pattern.“

Although Bal does not explicitly mention the (reverse) KL divergence here, it shows up when we look at the Hopfield neural network (see Example 3 below).

Example 2: Jay Alammar, “The Illustrated Transformer”

Jay Alammar has probably the best and most readable tutorials on transformers. He only mentions the KL divergence briefly – towards the very end of this blogpost – but he does work the KL-divergence into his description:

“How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence…”

Example 3: The KL Divergence and the Hopfield Neural Network (and Connection to Transformers)

This is best read after Matthias Bal’s blog on “An Energy-Based Perspective,” and shows how the KL divergence is an alternative way to express the learning process in a Hopfield neural network; check out Downing’s very well-illustrated slidedeck on “Neural Networks on Energy Landscapes.”

References and Resources

Prior Relevant Blogposts by AJM

Maren, Alianna J. 2024. “Twelve Days of Generative AI: Day 2 – Autoencoders.” Themesis, Inc. Blogpost Series (Dec. 7, 2024). (Accessed Dec. 20, 2024; available at “Autoencoders.”)
Maren, Alianna J. 2023. “Evolution of NLP Algorithms through Latent Variables: Future of AI (Part 3 of 3).” Themesis, Inc. Blogpost Series (May 30, 2023). (Accessed Dec. 20, 2024; available at “NLP.”)

Prior Relevant Blogposts by OTHERS

AJM’s Note: Matthias Bal’s Tutorials on Transformers and Statistical Mechanics. I love this one because it is ALMOST readable for those who are not particularly inclined towards stat-mech; all the key concepts, and well-expressed.

Bal, Matthias. 2020. “An Energy-Based Perspective on Attention Mechanisms in Transformers.” Mathais Bal’s GitHub Blogpost Series (Dec. 3, 2020). (Accessed Dec. 20, 2024; available online at “An Energy-Based Perspective …“.)

AJM’s Note: I found this pair of blogposts by Matthias Bal on transformers when doing a prior blogpost series on the K-L divergence, free energy, variational Bayes. This pair of posts also supports our study of transformers for NLP methods, such as BERT and the GPT series.

Bal, Matthias. 2020. “Transformer Attention as an Implicit Mixture of Effective Energy-Based Models.” Mattias Bal’s GitHub Blog Series (Dec. 30, 2020.) (Accessed Oct. 3, 2022. https://mcbal.github.io/post/transformer-attention-as-an-implicit-mixture-of-effective-energy-based-models/ .)
Bal, Matthias. 2020. “Transformers Are Secretly Collectives of Spin Systems: A Statistical Mechanics Perspective on Transformers.” Mattias Bal’s GitHub Blog Series. (Nov 29, 2021.) (Accessed Oct. 3, 2022; https://mcbal.github.io/post/transformers-are-secretly-collectives-of-spin-systems/ .)

AJM’s Note: Jay Alammar’s Transformer tutorials – Alamar has great tutorials on transformers. Here’s one where he mentions the connection with the KL divergence.

Alammar, Jay. 2018. “The Illustrated Transformer.” Jay Alammar’s GitHub.io Blogpost Series (June 27, 2018). (Accessed Dec. 20, 2024; available online at Jay Alamar’s Illustrated Transformer Tutorial.)

AJM’s Note: Sambit Barik’s NLP tutorials – very good discussion / presentation of the KL divergence, in a broader NLP context – relevant also to knowledge distillation (which is something that LLMs do).

Barik, Sambit Kumar. 2024. “Understanding KL Divergence for NLP Fundamentals: A Comprehensive Guide with PyTorch Implementation.” Medium.com (Sept. 14, 2024) (Accessed Dec. 20, 2024; available online at “Understanding …“.)

AJM’s Note: Transformer tutorials by Stefania Cristina – the original Vaswani et al. paper is not that hard to read, but sometimes, a more detailed explanation is useful. These two, by Stefania Cristina, first show how transformers are used in text vectorization, and then dive into transformer architectures.

Cristina, Stefania. 2022. “The Transformer Attention Mechanism.” Blogpost on MachineLearningMastery.com. (Sept. 15, 2022.) (Accessed May 2, 2023; available online at https://machinelearningmastery.com/the-transformer-attention-mechanism/.)
Cristina, Stefania. 2023. “The Transformer Model.” Blogpost on MachineLearningMastery.com. (January 6, 2023.) (Accessed May 2, 2023; available online at The Transformer Model.)
Dr. Cristina has a number of other transformer-related blogposts, including some that take you through model development; a link to her full set of MachineLearningMaster.com posts is HERE.

For Future Reference – Reinforcement Learning and LLMs and the KL Divergence

When we get into RLHF (reinforcement learning with human feedback), this is a good blogpost on how the (reverse) KL divergence is used on a combination of RLHF and LLMs.

Bhardwaj, Rakshit. 2024. “KL – Deviance in LLM – Reinforcement Learning from Human Feedback.” Rakshit Bhardwaj’s LinkedIn Post Series (Feb. 10, 2024). (Accessed Dec. 20, 2024; available online at “KL-Deviance in LLM.”)

Relevant Papers

AJM’s Note: This is the key article that kicked off the transformer-based evolution of NLP methods. How BERT and the GPT series use transformers is entirely different, but this is the crucial predecessor for both.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv:1706.03762 [cs.CL]. (Accessed May 2, 2023; available online at arXiv:1706.03762 [cs.CL].)