Evolution of NLP Algorithms through Latent Variables: Future of AI (Part 3 of 3)

AJM Note: ALTHOUGH NEARLY COMPLETE, references and discussion still being added to this blogpost. This note will be removed once the blogpost is completed – anticipated over the Memorial Day weekend, 2023. Note updated 12:30 AM, Hawai’i Time, Tuesday, May 29, 2023.

This blogpost accompanies a YouTube vid on the same topic of the Evolution of NLP (natural language processing) Algorithms.

Maren, Alianna J. 2023. “Evolution of AGI and NLP Algorithms Using Latent Variables: Future of AI (Part 3 of 3).” Themesis YouTube Channel (May 31, 2023). (Accessed June 1, 2023; available online at https://www.youtube.com/watch?v=s3hsebWsWHk.)

The entire class of natural language processing (NLP) algorithms – leading to current systems such as ChatGPT, GPT-X (whatever the current iteration is), and other large language models (LLMs) – are all logical descendants from Teuvo Kohonen’s earlier set of vector-mapping algorithms; the Least Vector Quantization (LVQ) method, and the Self-Organizing Topology-Preserving Map (SOTPM, or sometimes, simply SOM).

Figure 1. The original logical topology of neural networks (Maren, 1991), emphasizing class (c) – which at that time was based on Teuvo Kohonen’s LVQ and SOTPM neural networks.

And – most importantly – what made the breakthroughs possible, allowing LLMs to do their jobs – is the same kind of breakthrough that enabled other neural networks, and subsequently deep neural architectures, to work.

This key breakthrough here, as with the class (a) MLPs and the class (b) Boltzmann-machine type networks (laterally-connected, free energy-minimizing networks) is the incorporation of latent variables.

Figure 2. The previous “logical topology of neural networks class (c)” networks – the topologically-organized vectors – is transformed into a new class of algorithms, typically applied to natural language processing (NLP), where the key defining feature is the use of latent variables.

Latent variables: they’re the defining characteristic of powerful and effective neural networks.

We will see, as we go on, that latent variables are an overarching theme within artificial intelligence and machine learning methods, and are not confined to neural networks only.

Also, when we introduce latent variables into a system, we also need a means of achieving the values that these latent variables represent. This is typically done through a neural network training regime, which obtains connection-weight values between the visible and the latent nodes. This training regime has, so far, been one of two alternatives:

  • Stochastic gradient descent, a class of algorithms that includes backpropagation, and
  • Free energy minimization, which is used for all energy-based neural networks.

In the preceding post, we focused on two of the logical topology classes:

  • Class (a): Multilayer Perceptrons, trained using some form of stochastic gradient descent, and
  • Class (b): Energy-based neural networks, including the (Little-) Hopfield neural network, the Boltzmann machine, the restricted Boltzmann machine, and all their derivatives and descendants

In the rest of this post, we briefly give attention to three of the remaining six logical topology classes,

Then, we give our strongest attention to the set of natural language processing (NLP) algorithms – earlier appearing as “class (c)” in the original logical topology. We identify three key breakthroughs that have enabled evolution of the powerful NLP algorithms that we know today.

We also identify the limitations of these algorithms, based on our understanding of their underlying equations.

This means that we can then assess the “intelligence” that can emerge (or not) from the “evolved” class (c) algorithms.

Since, as a bottom line, these algorithms do not support a great deal of “intelligence,” we conclude by looking at a different realm of artificial intelligence / machine learning; one that can indeed potentially support artificial general intelligence, or AGI.

Finally, we note where an entirely new neural networks class can emerge; one distinct from the prior six classes identified in the (now) 35-year-old logical topology.


Context: Framework for a New Neural Networks Class

In this post, and in the accompanying YouTube, we return to our re-drawn logical topology (Maren, 1991) as a guide for investigating each of six different neural networks classes.

Figure 3. The context for this overall work is introducing a new neural networks class. To create a framework, in this blogpost and in the associated YouTube vid, we examine the class of language models.

But First – Two Prior Neural Networks Classes

In the previous blogpost and accompanying YouTube (see below), we investigated class (a) – the class of Multilayer Perceptrons (trained with stochastic gradient descent) and class (b) – the class of energy-based neural networks.

Figure 4. The PREVIOUS blogpost and associated YouTube focused on the first two classes within the logical topology: class (a) – Multilayer Perceptrons, and class (b) – energy-based neural networks. Full Chicago-style references are given at the end of this blogpost.

This YouTube, published on xxx, gives a concise contrast-and-compare between the Multilayer Perceptron class (a) and the class (b) energy-based neural networks, typified by the (restricted) Boltzmann machine.

Maren, Alianna J. 2023. “A New Neural Network Class: Creating the Framework.” Themesis, Inc. YouTube Channel (May 18, 2023). (Accessed May 22, 2023; available at https://www.youtube.com/watch?v=KHuUb627POs.)

Class (c): Topographically-Organized Sets of Vectors

This is the neural networks class that led to the suite of NLP algorithms, culminating in the current surge of large language models (LLMs).

In the beginning, though, this neural networks class (c) on Topographically- (originally Topographically-) Organized Set of Vectors was agnostic about the kind of vectors that were used.

It has only been since the early 2000’s that the methods of this class became tuned towards natural language processing, or NLP.

It’s a little bit of a judgment call as to whether or not you’d really call original members of this class true “neural networks.” Essentially, they were vectors – not “neural networks.” They were organized according a cosine similarity training regime.

Figure 5. Class (c) in the Logical Topology of Neural Networks contains those that perform topographic organization of vector set.

These neural networks were originally developed by Teuvo Kohonen (1995, 1997), and are characterized by the:

  • Least Vector Quantization (LVQ) networks – essentially, a “neural” implementation of a k-means algorithm, and
  • Self-Organizing Topology-Preserving Maps (SOTPMs, or sometimes, SOMs).

In a sense, these are not really neural networks at all. They are a set of methods for dealing with vectors, and for arranging those vectors in a certain “space” – usually a 2-D space, but not limited to that.

These methods used a very simple approach – cosine similarity – for moving vectors around in their “mapping space” so that the most similar vectors would wind up closest to each other.

As with many other neural networks, these methods became less popular during the decade-long “neural networks winter” starting in the early 2000’s.

The important thing to note is: neither of these methods involved any latent variables.

The breakthroughs started to happen in the early 2000’s, and centered on natural language processing (NLP) applications.

The Breakthroughs Enabling Current NLP Algorithms

We return now to our class (c) neural networks, the “topologically-organized” (earlier, “topographically-organized”) neural networks – which were those devised by Kohonen in the 1980’s.

We recall that the least vector quantization, or LVQ, network was a “neural” implementation of the k-means clustering algorithm. This algorithm was limited. Essentially, it didn’t have latent variables.

There were two key breakthroughs and what we call an “umbrella paper” (a substantial evolutionary step), each of which involved latent variables. Each was significant and different from the others. Together, they have enabled the entire new realm of NLP architectures; the entire suite of large language models (LLMs).

These NLP breakthroughs were:

  • Breakthrough #1 – Latent Dirichlet Allocation (LDA): Finding latent variables in text corpora that allowed creation of “topics,” and not just assignment of documents to “clusters,”
  • “Umbrella papers” – Word2Vec and Doc2Vec: Finding a new, latent-variable-enabled means for text vectorization, allowing an advance beyond TF*IDF, and providing a powerful method useful on large (“Google-scale”) corpora, and
  • Breakthrough #2 – the “attention mechanism” empowering transformers: Using a new form of “attention” to enable predicting a “next word” in a sequence of generated words, and thus providing a basis for all the LLMs.

First Breakthrough: From k-Means to Latent Dirichlet Allocation

The first breakthrough was with David Blei, Andrew Ng, and Michael Jordan, when they introduced the Latent Dirichlet Algorithm (LDA) in 2003.

Figure 6. When Blei, Ng, and Jordan invented the Latent Dirichlet Allocation (LDA) method in 2003, they made it possible to go beyond the simple k-means clustering algorithm. (See Resources and References for full reference list.)

Not surprisingly, the LDA algorithm involves latent variables. As the authors state, “By marginalizing over the hidden topic variable z, however, we can understand LDA as a two-level model.”

(AJM’s note: The LDA is a three-level model; we obtain the two-level by marginalizing over one of the latent, or hidden, variables z, which are the set of topics. The observable or “visible” elements are the total numbers of words and the total numbers of documents in the corpus.)

The LDA introduced a key innovation: the opportunity to model a corpus as a set of topics, where each document could contribute to multiple topics. This was different from the hitherto dominant approach of modeling the elements of a corpus using k-means clustering.

Figure 7. The Latent Dirichlet Allocation (LDA) algorithm, invented by Blei, Ng, and Jordan in 2003 (see Resources and References), introduces topics as a set of latent variables. A single document can contribute terms to multiple topics, and a single topic can accept term contributions from multiple documents. The total set of topics functions as a set of latent variables.

In essence, the LDA breakthrough – involving the latent variables of the topics – was equivalent to Ackley, Sejnowski, and Hinton inventing the Boltzmann machine in 1985. The key element in making the breakthrough happen was incorporation of latent variables.

This earlier Alianna J. Maren video gives a “contrast-and-compare” between clustering (k-means) and classification, with some discussion of LDA.

Maren, Alianna J. 2021. “NLP: Clustering vs. Classification” Alianna J. Maren YouTube Channel. (Available at https://www.youtube.com/watch?v=PtkPBRMA6pY&list=PLUf2R_am1DRIlIGHxnKMC01UGxrQrWJUS&index=3.)

Significant “Umbrella” Paper: From Tf*Idf to Word2Vec

A substantial advance in using neural networks to advance NLP methods came in 2013, when Mikolov and colleagues at Google invented the Word2Vec (and shortly thereafter, Doc2Vec).

Figure 8. Mikolov and colleagues at Google developed the Word2Vec and Doc2Vec algorithms (2013), building on a ten-year history of text-vectorization methods. Collectively, these methods took the prior TF*IDF method to a means more suitable for very large (Google-scaled) text corpora.

The Tf*Idf (term frequency * inverse document frequency) method had been dominant up until this time. It relied on term frequency-counting, both within a single document, and for that term across the entire corpus of documents.

Figure 9. The TF*IDF method creates a short vector representation of the terms in a corpus based on frequency-counting, both within-document and across the corpus.

There were many things that a person could do to increase the effectiveness of TF*IDF, particularly in controlled-vocabulary corpora. For example, removing stop words and creating “equivalence classes” (where various terms were set to be “equivalent” to a single reference term) improved results greatly.

However, as the size and diversity of world-wide corpora increased, companies such as Google felt an increased pressure to develop a more globally-useful method.

The result was Word2Vec (and later Doc2Vec), developed by Mikolov and Google colleagues in 2013.

Figure 10. The Doc2Vec method (and similarly, Word2Vec) maps words from a document (or corpus) into a smaller vector representation. This vector is itself the set of hidden, or latent, variables. The strengths assigned to each element of this vector representation for each particular word in the corpus become the nonlinear, neural network-based latent variable representation for that word.

This earlier Alianna J. Maren video gives a “contrast-and-compare” between TF*IDF and Doc2Vec.

Maren, Alianna J. 2021. “NLP: Tf-Idf vs Doc2Vec – Contrast and Compare.” Alianna J. Maren YouTube Channel. (Available at https://www.youtube.com/watch?v=iSkbq6Tjkj0&list=PLUf2R_am1DRIlIGHxnKMC01UGxrQrWJUS&index=2&t=73s.)

Second NLP Breakthrough: Use of Transformers (BERT, GPTs, and More)

The second major breakthrough came about when we started using transformers instead of a simple stochastic gradient descent method for text vectorization.

Transformers

Yes, transformers – invented by Vaswani et al. (2018) – were the great, “transformative” (oh, ouch! the pun!) breakthrough that gave the NLP a HUGE surge forward.

Rather than present (yet another!) transformer tutorial, we invite you to check out a pair of excellent tutorials by Stephania Christina. (See the Resources and References, at the end, under the Transformers section.) That section also includes other transformer tutorials – both technical blogposts and YouTube videos.

However, we briefly note that the most important feature of transformers is that they take the basic Ising equation (a fundamental and much-used statistical mechanics model) and take it one powerful step further.

In brief, the Ising equation consists of two elements – an enthalpy term and an entropy term. In transformers, the enthalpy term is “souped up.” It is a magnitude (or more) more powerful than the simpler form used (for example) in all of the energy-based neural networks.

In short, transformers are still based on statistical mechanics – just a more powerful version thereof.

While we are not, in this section, really recommending one transformer tutorial more than another (we offer a decent smorgasbord in the Resources and References), we do like the pair of tutorials offered by Mattias Bal, where he describes the statistical mechanics foundations of transformers. (And again, the references are in the Resources and References section, under Transformers.)


BERT: Bidirectional Encoder Representations from Transformers

BERT (Bidirectional Encoder Representations from Transformers) was the first major NLP outgrowth that used transformer methods (Devlin et al., 2018). Devlin et al., were – like Mikolov and colleagues – also at Google.

The great value of BERT is that it is “bidirectional” – that is, it uses words that occur in the sequence AFTER the word that is currently being modeled, which gives it greater context and sensitivity to keywords. This ability has also made BERT a preferred mechanism to understand sentiment, as sentiment expressions can often involve negations of terms, e.g. “not happy.”

The key innovation – inherent in its name – was the use of transformers in the neural network architecture.

There have been numerous BERT specializations since its original invention in 2018.

The Resources and References section gives a number of good BERT tutorials, as well as the original paper.


The GPT-Series

There have been so many blogs, tutorials, YouTubes, and whatevers about the GPT-series of NLP engines that we find the whole topic to be cumbersome.

Instead, we suggest that the interested reader select from among several “well-curated” offerings in the Resources and References section, under GPT-Series.


ChatGPT

AJM’s Note: We frankly find this whole ChatGPT obsession rather exhausting. (And that includes the majestic, plural, and editorial forms of “we.”) Bluntly put – for all the shiny-newness of it all, it’s built on a transformer, which is built around a very souped-up Ising model (statistical mechanics), and that is derived all the way back from the Little-Hopfield neural network. Meaning, it is a content-addressable memory device.

ChatGPT, as with all GPT’s and all other LLMs, generates its response a word at a time, based on the prior language construct (including the input). This is so NOT AGI.

Nor is it a path to AGI.

Yes, there are now a huge number of rules in place – governing the GPTs and preventing them from doing (deliberate) social harm. There are some strapped-on common-sense reasoning abilities.

Nevertheless, these methods do not integrate a full world ontology. Thus, they have no idea of what they’re talking about. Literally.

That makes them boring as hell.

We are much more interested in studying Hafner et al.’s 2022 paper on “Active Perception Divergence.” (See the full reference in Resources and References, under AGI.)

We are completely ignoring anything that relates to ChatGPT, including the plethora of “cheatsheets.”

At the same time – we DO think that listening to Sam Altman (CEO of OpenAI, which has been developing all the GPT-series) is interesting and worth our time. See an embedded YouTube at the end of this post.


The Remaining Three Neural Networks Classes

The primary intent of this post has been to discuss how latent variables enabled breakthroughs in the class (c) “topologically-organized” (vs. the original “topographically-organized”) neural networks.

Rather than take a separate blogpost (and vid) for the remaining three classes in the original logical topology, we briefly address the three other network classes here:

  • Class (d): Bidirectional resonating networks: the adaptive resonance theory, or ART, networks,
  • Class (e): Multilayer cooperative/competitive networks: the Neocognitron, and its descendants – the various convolutional neural networks (CNNs), and
  • Class (f): Hybrid networks: although there were early (and now largely forgotten) hybrids, the hybrids that are important now are the “deep” neural networks. These are the deep learning systems, in which layers of Boltzmann machines use backpropagation to augment their training, and generative adversarial networks (GANs), in which MLPs with backpropagation and Boltzmann machines “compete” with each other.

Class (d): Adaptive Resonance Theory

The class (d) neural networks were originally identified as “bidirectional resonating” neural networks. However, the only neural network that has been significant in this class was the adaptive resonance theory, or ART, neural network, which was devised by Gail Carpenter and Steve Grossberg (1987a & b).

Figure 11. Steve Grossberg’s book, Conscious Mind, Resonant Brain, captures his insights forming a cohesive set of problem statements that led to formation of the adaptive resonant theory (ART) neural networks, together with colleague Gail Carpenter.

While ART neural networks have not found the widespread use and application that has been found for the class (a) MLPs and for the class (b) energy-based neural networks, they come from a set of very insightful questions asked by Grossberg, Carpenter, and their colleagues over several decades. We can be inspired by the “problem statements” advocated by Grossberg (2020) as we frame the next generation of neural networks.

Steve Grossberg has put together a great YouTube in which he explains the limitations of class (a) and (b) neural networks, and introduces the strong “problem statement” behind adaptive resonance theory networks.

YouTube: Steve Grossberg discusses “explainable and reliable AI,” including problems with deep learning methods and the unique perspective offered in adaptive resonance theory.
Grossberg, Stephen. 2022. “Explainable and Reliable AI: Comparing Deep Learning with Adaptive Resonance – Stephen Grossberg.” SAI Conference presentation. (Accessed May 25, 2023, available online at https://www.youtube.com/watch?v=RmbtXGp1avk.)

For a quicker take on the same subject, the I2 Team at the University of Washington has put together this interview with Dr. Grossberg.

Interview by the I2 Team (University of Washington) with Dr. Stephen Grossberg. Interactive Intelligence. 2022. “Dr. Grossberg’s Critical Take on Deep Learning | I2” (Interview with Steve Grossberg). Interactive Intelligence YouTube Channel. (Accessed May 25, 2023, available online at https://www.youtube.com/watch?v=5LgN440VThw.)

Class (e): Multilayer Cooperative/Competitive Neural Networks (Leading to Convolutional Neural Networks, or CNNs)

This class of neural networks was initiated by Kunihiko Fukushima, who invented the Neocognitron (1980). This neural network was inspired by research into the visual cortex, and contained layers with two different kinds of “cells,” which could receive (respectively) “excitatory” or “inhibitory” stimulus. Although there was a training rule, it was not as powerful as was desired.

We can envision the various “cells” in the Neocognitron layers as being early latent variable instances, which could not achieve their full strength because a stronger training mechanism was needed.

The breakthrough came in 1989, when Yann LeCun and colleagues invented the convolutional neural network (CNN), trained with backpropagation (LeCun et al., 1989). Addition of the more powerful backpropagation training enabled the initial CNN, and multiple CNN generations after that, to solve successively more complex and demanding image recognition/classification tasks.

Since then, many increasingly powerful CNNs have evolved. (See the Resources and References section for multiple links.)

One important factor crucial to CNN evolution was the ImageNet data set.

Fei-Fei Li, inspired by Christiane Fellbaum’s creation of the WordNet training data, proposed creation of the ImageNet database. It was because this (very large) image data set provided a means for benchmarking image classification performance that we were able to assess improvements in CNN architectures.


Class (f): Hybrid Neural Networks: Deep Networks and Generative Adversarial Networks (GANs)

There have been two primary hybrid systems that have evolved over the past few decades:

  • Deep learning – involving multiple layers of (restricted) Boltzmann machines, with training fine-tuned by backpropagation (or some other stochastic gradient descent), and
  • Generative Adversarial Networks (GANs) – where generative neural networks (restricted Boltzmann machines) “compete” with discriminative neural networks to fine-tune each other’s capabilities.
Figure 13. The two most popular hybrid methods are deep learning and generative adversarial networks, or GANs.

References in the Resources and References section.


Artificial General Intelligence: The Progenitor

Artificial general intelligence (AGI) will most likely emerge from variational inference methods. In particular, it will most likely emerge from active inference, a methodology introduced and espoused by Karl Friston and colleagues over the past decade.

Figure 14. Variational inference is the most likely avenue to artificial general intelligence, or AGI. There is some connection between the latent Dirichlet allocation (LDA) method and variational inference, in that David Blei, who was the lead author of the LDA invention, and who was also the lead author on a significant tutorial on variational inference. See the LDA section in Resources and References for the Blei et al. paper on LDA, and the AGI section (near the end) for the Blei et al. variational inference review.

We did a baker’s dozen of blogposts on the Kullback-Leibler divergence, free energy, and variational methods – with links to crucial resources in that final (bonus) blogpost. (Full citation in the final Resources and References section, last sub-section.)

And most recently – and most exciting! – work by Danijer Hafner together with Karl Friston and other colleagues on “Action Perception Divergence” – seems to offer an even more likely AGI vector.

We’ll discuss this paper in future Themesis blogposts and YouTube vids.


Live free or die,” my friend!*

* “Live free or die” – attrib. to U.S. Revolutionary War General John Starck. https://en.wikipedia.org/wiki/Live_Free_or_Die

Alianna J. Maren, Ph.D.

Founder and Chief Scientist

Themesis, Inc.



Resources and References

There are three primary components to the Resources and References:

  • Prior related YouTubes,
  • Prior related blogposts, and
  • The cited literature.

Prior Related YouTubes

These two YouTubes are listed in suggested watching order:

Maren, Alianna J. 2023. “A New Neural Network Class: Creating the Framework.” Themesis, Inc. YouTube Channel (May 18, 2023). (Accessed May 22, 2023; available at https://www.youtube.com/watch?v=KHuUb627POs.)
Maren, Alianna J. 2023. “Kuhnian Normal and Breakthrough Moments (Revised and Updated).” Themesis, Inc. YouTube Channel (May 16, 2023). (Accessed May 22, 2023; available at https://www.youtube.com/watch?v=_pg28LJ56ME&t=375s.)

Each of these YouTubes has an associated blogpost, see below.


Prior Related Blogposts (with References)

Predecessor Blogposts in This Series

  • Maren, Alianna J. 2023. Kuhnian Normal and Breakthrough Moments: The Future of AI (Part 1 of 3).” Themesis Blogpost Series (Feb. 1, 2023). (Accessed May 22, 2023; available online at: https://themesis.com/2023/02/01/kuhnian-normal-and-breakthrough-moments/.) (AJM’s Note: This blogpost goes with the FIRST YouTube in this series, “Kuhnian Normal and Breakthrough Moments (Revised and Updated).”)
  • Maren, Alianna J. 2023. “New Neural Network Class: Framework: The Future of AI (Part 2 of 3).” Themesis Blogpost Series (May 16, 2023). (Accessed May 22, 2023; available online at: https://themesis.com/2023/05/16/new-neural-network-class/.) (AJM’s Note: This blogpost goes with the MOST RECENT YouTube in this series, “A New Neural Network Class: Creating the Framework.”)

The Cited Literature

The references to literature cited in this blogpost are organized into groups corresponding to the sections of this blogpost.

The Starting Point: The Logical Topology of Neural Networks

AJM’s Note: The original “logical topology” paper:


Class (c): Topologically-Organized Sets of Vectors and k-Means Clustering

TF*IDF (Term Frequency * Inverse Document Frequency)

AJM’s Note: The person who invented the TF*IDF algorithm was Karen Spärck Jones, whose contributions have not been given the attention that is due.

  • Bowles, Nellie. 2019. “Overlooked No More: Karen Spärck Jones, Who Established the Basis for Search Engines.” The New York Times (Jan. 2, 2019).

AJM’s Note: Margaret Masterman, founder of the Cambridge Language Research Unit and mentor to Karen Spärck Jones (who invented the TF*IDF algorithm), was unappreciated in her own life, yet left a huge legacy to the emerging field of natural language processing.

K-Means Clustering and the Least Vector Quantization Algorithm

  • Grossberg, S. 1975. “On the Development of Feature Detectors in the Visual Cortex with Applications to Learning and Reaction-Diffusion Systems.” Biological Cybernetics 21: 145-159. (Accessed May 28, 2023; available online at https://sites.bu.edu/steveg/files/2016/06/Gro1975BiolCyb.pdf.) (This paper is where Grossberg outlined the proof. See p. 153, where it cited Grossberg (1976; it was in preparation at that time) as the source of a more detailed exposition.)
  • Grossberg, S. 1976. “Adaptive Pattern Classification and Universal Recoding, I: Parallel Development and Coding of Neural Feature Detectors.” Biological Cybernetics 23: 121-134. (Accessed May 28, 2023; available online at https://sites.bu.edu/steveg/files/2016/06/Gro1976BiolCyb_I.pdf.)
  • Kohonen, Teuvo. 1982. “Self-Organized Formation of Topologically Correct Feature Maps.” Biol. Cybern. 43: 59-69. (Accessed April 26, 2023; available online at https://tcosmo.github.io/assets/soms/doc/kohonen1982.pdf.)
  • T. Kohonen, Teuvo. 1995. “Learning Vector Quantization”, in M.A. Arbib (ed.), The Handbook of Brain Theory and Neural Networks. (Cambridge, MA: MIT Press), 537–540.
  • T. Kohonen. 1997. Self-Organizing Maps. Berlin: Springer.

The NLP Breakthrough Work

Super Evolution-of-Algorithms Tutorials

AJM’s Note: References still being added to this section.


Latent Dirichlet Allocation (LDA)

AJM’s Note: References still being added to this section.


Word2Vec and Doc2Vec

This section includes not only the “umbrella” 2013 paper published by Mikolov et al., but a fine tutorial (Zafar Ali, 2019) and also some key works cited by Mikolov et al. (Note that the lead author for the 2014 Doc2Vec method, following the original 2013 Word2Vec method, was Q. Le with Mikolov as the second author.)

  • Ali, Zafar. 2019. “A Simple Word2Vec Tutorial.” Medium.com. (Accessed May 2, 2023; available online at A Simple Word2Vec Tutorial.)
  • Bengio, Yoshio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. “A Neural Probabilistic Language Model.” Journal of Machine Learning Research 3:1137-1155. (Accessed May 24, 2023; available online at https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf.)
  • Hinton, Geoffrey E. 2000. “Training Products of Experts by Minimizing Contrastive Divergence.” Technical Report GCNU TR 2000-004, Gatsby Unit, University College London. (Accessed May 24, 2023; available online at https://www.cs.toronto.edu/~hinton/absps/tr00-004.pdf)
  • Le, Quoc, and Tomáš Mikolov. 2014. “Distributed Representations of Sentences and Documents.” Proceedings of the 31st International Conference on Machine Learning, PMLR 32 (2):1188-1196.
  • Mikolov, Tomáš, Stefan Kombrink, Lukáš Burget, Jan Černocký, and Sanjeev Khudanpur. 2011. “Extensions of Recurrent Neural Network Language Model.” Proceedings of ICASSP 2011. (Accessed May 24, 2023; available online at https://ieeexplore.ieee.org/document/5947611.)
  • Mikolov, Tomáš, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv:1301:3781 [cs.CL] (Accessed May 2, 2023; available online at arXiv:1301.3781.)

Transformers

AJM’s Note: This is the key article that kicked off the transformer-based evolution of NLP methods. How BERT and the GPT series use transformers is entirely different, but this is the crucial predecessor for both.

  • Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv:1706.03762 [cs.CL]. (Accessed May 2, 2023; available online at arXiv:1706.03762 [cs.CL].)

AJM’s Note: This is an excellent YouTube discussion of transformers; Lex Fridman interviews Andrej Karpathy. This one is worth watching and re-watching; several times.

  • Fridman, Lex. 2022. “Transformers: The best idea in AI | Andrej Karpathy and Lex Fridman.” Lex Clips (Extracts from the Lex Fridman YouTube podcast series; about 8 1/2 minutes for this one). (Accessed May 28, 2023; available online at https://www.youtube.com/watch?v=9uw3F6rndnA.)
Lex Fridman interviews Andrej Karpathy on transformers.

AJM’s Note: Transformer tutorials – the original Vaswani et al. paper is not that hard to read, but sometimes, a more detailed explanation is useful. These two, by Stefania Cristina, first show how transformers are used in text vectorization, and then dive into transformer architectures.

  • Cristina, Stefania. 2022. “The Transformer Attention Mechanism.” Blogpost on MachineLearningMastery.com. (Sept. 15, 2022.) (Accessed May 2, 2023; available online at https://machinelearningmastery.com/the-transformer-attention-mechanism/.)
  • Cristina, Stefania. 2023. “The Transformer Model.” Blogpost on MachineLearningMastery.com. (January 6, 2023.) (Accessed May 2, 2023; available online at The Transformer Model.)
  • Dr. Cristina has a number of other transformer-related blogposts, including some that take you through model development; a link to her full set of MachineLearningMaster.com posts is HERE.

AJM’s Note: Transformers. The hottest thing in AI/ML. And yet ANOTHER thing that totally rests on statistical mechanics. I found this pair of blogposts by Mattias Bal on transformers when doing a prior blogpost series on the K-L divergence, free energy, variational Bayes. This pair of posts also supports our study of transformers for NLP methods, such as BERT and the GPT series.


BERT: Bidirectional Encoder Representations from Transformers

AJM’s note: This is the original BERT paper, by Devlin et al., 2018.

  • Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv (Accessed April 26, 2023; available online at https://arxiv.org/abs/1810.04805.)

AJM’s note: I particularly like this pair of articles by Ranjeet K. Gupta. They are comprehensive, stretching across numerous algorithms and breakthroughs. They are also well-illustrated, broken down to simple concepts, and very easy to follow!

Collage of various NLP concepts and architectures that has led the way to modern transformer-based NLP model BERT.

Ranjeet Gupta, “Journey to BERT: Part 2.”

AJM’s Note: One of the important things to understand is the distinction between the two NLP paths that evolved from transformers: BERT and the GPT series (including ChatGPT and its subsequent iterations). This is a nice contrast-and-compare tech blog addressing BERT and GPT-3.

  • Mottesi, Celeste. 2023. “GPT-3 vs. BERT: Comparing the Two Most Popular Language Models.” Invgate.com (February 9, 2023). (Accessed April 26, 2023; available online at https://blog.invgate.com/gpt-3-vs-bert.)

The GPT-Series of LLMs

AJM’s Note: Why not just go to the source? The OpenAI GPT-4 Tech Report, at 100 pages.

  • OpenAI. 2023. “GPT-4 Technical Report.” arXiv:2023.08774v3 [cs.CL] (Mar. 27, 2023). (Available online at arXiv:2303.08774v3 [cs.CL].)

AJM’s Note: Balance this with something more immediately-accessible, such as the Lex Fridman interview with Sam Altman. See the embedded YouTube link at the end of this blogpost.

Other LLMs, etc.

OK. If you must.

Here’s one cross-compare article.

Keep in mind that it will be obsolete a few weeks from now.


ChatGPT

AJM’s Note: References still being added to this section.

One of the detractors of ChatGPT, Minhaj posts a scathing review:

Rehman cites the following study by researchers at Stanford and Columbia that disproves some earlier claims about ChatGPT.

  • Zhang, Tianyi, et al. 2023. “Benchmarking Large Language Models for News Summarization.” arXiv:2301.13848v1 [cs.CL] (Jan. 31, 2023). doi: 10.48550/arXiv.2301.13848. (Accessed May 17, 2023; available online at https://www.arxiv-vanity.com/papers/2301.13848/.)

Class (d): Adaptive Resonance Theory (ART) Neural Networks

AJM’s Note: This class is now solely devoted to adaptive resonance theory (ART) networks, invented by Gail Carpenter and Steve Grossberg. The Grossberg (1976) article on “Adaptive Pattern Classification …” (see reference in the “Class (c) Topologically-Organized …” section) was a precursor to the one below.

By this point, I had already shown in these articles that catastrophic forgetting was possible in models of competitive learning and self-organizing maps. I introduced Adaptive Resonance Theory in Part II of the (1975/76) article(s) to dynamically stabilize learning and memory using attentional focusing by learned top-down expectations.

Stephen Grossberg, describing the evolution of his work. Personal communication.

AJM’s Note: The following article contains a large number of ART applications.

  • Carpenter, Gail A., and Stephen Grossberg. 1987a. “A Massively Parallel Architecture for a Self-Organizing Neural Pattern Recognition Machine.” Computer Vision, Graphics, and Image Processing, 37: 54–115. (Accessed May 22, 2023; available online as http://techlab.bu.edu/files/resources/articles_cns/CarpenterGrossberg1987.pdf.)
  • Carpenter, G.A. and Grossberg, S. 1987b. “ART 2: Stable Self-Organization of Pattern Recognition Codes for Analog Input Patterns.” Applied Optics, 26: 4919-4930. doi:10.1364/AO.26.004919. (Accessed May 22, 2023; available online as https://opg.optica.org/ao/fulltext.cfm?uri=ao-26-23-4919&id=30891.)
  • Grossberg, Stephen. 2020. Conscious Mind, Resonant Brain: How Each Brain Makes a Mind, 1st Edition. Cambridge, UK: Oxford University Press.
  • Grossberg, Stephen. 2022. “Explainable and Reliable AI: Comparing Deep Learning with Adaptive Resonance – Stephen Grossberg.” YouTube video based on Grossberg’s SAI Conference presentation. (Accessed May 25, 2023, available online at https://www.youtube.com/watch?v=RmbtXGp1avk.)
  • Interactive Intelligence. 2022. “Dr. Grossberg’s Critical Take on Deep Learning | I2” (Interview with Steve Grossberg). Interactive Intelligence YouTube Channel. (Accessed May 25, 2023, available online at https://www.youtube.com/watch?v=5LgN440VThw.)

The full suite of over 500 of Grossberg archival articles, including the above ones, can be downloaded from Grossberg’s Boston University web page sites.bu.edu/steveg, along with various videos of interviews, keynote lectures, etc.


Class (e): Multilayer Cooperative/Competitive Neural Networks (Leading to Convolutional Neural Networks, or CNNs)

AJM’s Note: References still being added to this section.

AJM’s Note: This is the original Fukushima paper, in which he proposed a multi-layered system designed to recognize handwritten Japanese kanji characters.

  • Fukushima, Kunihiko. 1980. “Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position.” Biol. Cybernetics 36: 193 202 (1980) (Accessed May 22, 2023; available online as https://www.rctn.org/bruno/public/papers/Fukushima1980.pdf.)

AJM’s Note: This is the original Yann LeCun et al. paper, presenting the first known instance of a convolutional neural network.

AJM’s Note: A follow-on work by LeCun and Bengio beautifully summarizes CNN development.

AJM’s Note: Andrej Karpathy has a good and useful blogpost, in which he reproduced the original 1989 LeCun et al. findings.

AJM’s Note: This is a key CNN breakthrough paper, presenting the AlexNet.

  • Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2017. “ImageNet Classification with Deep Convolutional Neural Networks.” Communications of the ACM 60 (6) (May 24, 2017): 84-90. doi:10.1145/3065386. (Accessed May 28, 2023; available online at https://dl.acm.org/doi/10.1145/3065386.)

AJM’s Note: Very good interview where Ilya Sutskever explains how he developed the first major convolutional neural network – the AlexNet, in concert with Alex Krizhevsky and Geoffrey Hinton.

  • Fridman, Lex with Ilya Sutskever. 2020. “Ilya Sutskever: Deep Learning | Lex Fridman Podcast #94.” Lex Fridman YouTube podcast series. (Accessed May 28, 2023; available online at https://www.youtube.com/watch?v=13CZPWmke6A.)

AJM’s Note: Deep Boltzmann machines (DBMs) are also useful for some forms of image recognition. This is an example of using a DBM for analyzing a class of medical images. They contrast and compare with similar work using a CNN.

  • Jeyaraj, Pandia Rajan, Edward Rajan, and Samuel Nadar. 2019. “Deep Boltzmann Machine Algorithm for Accurate Medical Image Analysis for Classification of Cancerous Region.” Cognitive Computation and Systems 1 (3): 85-90 (24 September 2019). doi:10.1049/ccs.2019.0004. (Accessed March 28, 2023; available online at https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/ccs.2019.0004.)

AJM’s Note: The following are two very nice overviews of CNN evolution over the past several years. The second is comparable to the first, slightly different selection of CNNs. Both are good reads, very well-illustrated.

The ImageNet database, created and nurtured by Fei-Fei Li.


Class (f): Hybrid Neural Networks

There are two primary classes of hybrids; deep learning methods that contain multiple layers of (restricted) Boltzmann machines, with various inclusions of discriminative layers (e.g., for final classification), and generative adversarial networks, or GANs.

For the sake of simplicity, we feature only a few articles on each.

Deep Learning

AJM’s Note: This is an excellent deep learning review article by AI leaders Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. It is extremely readable – at the tutorial level, almost. It covers convolutional neural networks (CNNs) as well as deep learning architectures for a variety of tasks. The discussion of various issues, such as learning representations in neural networks, is excellent. This is an important “must-read” paper in anyone’s AI education.

AJM’s Note: These two papers, by Salakhutdinov and Hinton, established the deep learning architectures.

  • Hinton, G.E., and R. R. Salakhutdinov. 2006. Reducing the Dimensionality of Data with Neural Networks. Science 313(5786) (July 28, 2006): 504-507. doi: 10.1126/science.1127647. (Accessed April 5, 2022.) https://www.cs.toronto.edu/~hinton/science.pdf
  • Salakhutdinov, Ruslan, and Geoffrey Hinton. 2012. “An Efficient Learning Procedure for Deep Boltzmann Machines.” Neural Computation 24(8) (August, 2012): 1967–2006. doi: 10.1162/NECO_a_00311. (Accessed April 3, 2022.) https://www.cs.cmu.edu/~rsalakhu/papers/neco_DBM.pdf

AJM’s Note: What made the whole realm of deep learning possible was Hinton’s 2002 invention of the contrastive divergence training algorithm for a (restricted) Boltzmann machine.

For a bit of discussion on these three papers, including how entropy is treated in the Boltzmann machine, check out this 2022 Themesis blogpost on “Entropy in Energy-Based Neural Networks – Seven Key Papers (Part 3 of 3)” (https://themesis.com/2022/04/04/entropy-in-energy-based-neural-networks-seven-key-papers-part-3-of-3/).


Generative Adversarial Networks

AJM’s Note: This is the invention of GANs, spearheaded by Goodfellow and his colleagues at the Universite de Montreal.

  • Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi, Mirza, Bing Xu, David Warde-Farely, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Networks.” arXiv:1406.2661v1 [stat.ML] (10 Jun 2014). (Accessed May 27, 2023; available online at https://arxiv.org/pdf/1406.2661.pdf.)

Artificial Narrow Intelligence: Reinforcement Learning

AJM’s Note: Up until the moment that we found THIS article, we would have said that AGI was the purview of methods rooted in variational inference (see references below), and that reinforcement learning was limited to being an ANI (artificial narrow intelligence).

This article, by Hafner et al., changes things a bit.

From their Abstract:

Applied out of the box, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in artificial intelligence. Our general algorithm makes reinforcement learning broadly applicable and allows scaling to hard decision making problems.

Hafner et al. 2023. “Mastering Diverse Domains through World Models.” (See full citation below.)
  • Hafner,Danijar, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. 2023. “Mastering Diverse Domains through World Models.” arXiv:2301.04104v1 [cs.AI] (10 Jan 2023). (Accessed May 27, 2023; available online , at https://arxiv.org/pdf/2301.04104.pdf.)

AJM’s Note: A lot of the ANI/AGI discussion will involve comparisons between reinforcement learning and inference-based methods. This paper, by Sajid et al., is an excellent discussion and cross-compare of the two methods.

Note that two of the co-authors of this paper, Thomas Parr and Karl Friston, are also co-authors with Danijar Hafner and others on the “Action and Perception” paper that we cite in the following subsection (AGI references).

  • Sajid, N., Philip J. Ball, Thomas Parr, and Karl J. Friston. 2020. “Active Inference: Demystified and Compared.” arXiv:1909.10863v3 [cs.AI] 30 Oct 2020. (Accessed 17 June 2022; https://arxiv.org/abs/1909.10863 )

Artificial General Intelligence (Variational Inference, Active Inference, and Action Perception Divergence (APD)

AJM’s Note: This is the Blei et al. paper on variational inference that I mentioned in the accompanying YouTube vid. It’s a useful and important tutorial.

  • Blei, David, Kucukelbir, and McAuliffe. 2018. “Variational Inference: A Review for Statisticians.” arXiv:1601.00670v9 [stat.CO] (9 May 2018). (Accessed May 25, 2023; available online at https://arxiv.org/pdf/1601.00670.pdf.)

AJM’s Note: This work by Hafner and colleagues extends the notions of active inference into “Action Perception Divergence (APD),” which is very possibly the most significant vector into real AGI.

  • Hafner, Danijar, Pedro A. Ortega, Jimmy Ba, Thomas Parr, Karl Friston, and Nicolas Heess. 2022. “Action and Perception as Divergence Minimization.” arXiv:2009.01791v3 [cs.AI] (13 Feb 2022). (Accessed May 25, 2023; available online at https://arxiv.org/pdf/2009.01791.pdf.)

AJM’s Note: The following blogpost contains ALL the resources and references given over the previous twelve blogposts on “Kullback-Leibler Divergence, Free Energy Minimization, and Variational Inference.” It’s the best compendium around – organized by ease-of-reading/understanding, from “Bunny Trails” (rank beginners) through “Blue Squares” (intermediate) up to “Black Diamond” (super-expert-level).

AJM’s Note: It is VERY difficult to learn variational inference on one’s own. The real challenge is not so much the equations themselves, it is cross-correlating the notation between two or more important papers.

When I was teaching myself the fundamentals, I referred primarily to Matthew Beal’s dissertation on variational inference, because that was the work referenced by Karl Friston. (See all the references in the above “Resource Compendium” blogpost.)

This arXiv paper is one that I wrote, mostly for myself, to do the notational cross-correspondences. I called it a “Rosetta stone” tutorial, because it was a cross-referencing of notation between three different authors, much as the original Rosetta stone presented the same text in three different languages.

  • Maren, Alianna J. 2022. “Derivation of the Variational Bayes Equations.” Themesis Technical Report TR-2019-01v5 (ajm)arXiv:1906.08804v5 [cs.NE] (4 Nov 2022). (Accessed Nov. 17, 2022; pdf.)

This arXiv paper (Themesis Technical Report) is the source for the famous two-way deconstruction of the Kullback-Leibler divergence for a system with a Markov blanket, which is central to Friston’s work.

Figure 15. (Taken from a prior Themesis blogpost; the above-mentioned Resource Compendium.) Friston’s Eqn. 2.7 from “Life as We Know It” models a system which contains both external states (Psi) and internal states (r), separated by a Markov boundary (s and a). The Kullback-Leibler divergence presented at the top can be deconstructed in two different ways. Each of the results is a valid equation; one is more computationally useful than the other. For simplicity, the modeling parameter m is not shown in the model probability distribution p(Psi, s, a, r). Figure is extracted from Maren (2022), and was discussed in the previous blogpost (Part 5), and introduced in the blogpost prior to that one (Part 4).

AJM’s Note: Sam Altman on AGI. I really like this podcast. Fridman assembled a set of very good questions. Altman is thoughtful and insightful. This was one of my favorite “listen-to” YouTube vids, played in background mode as I prepared this blogpost.)



Neuroscience News

This is breaking research by Aryn Gettis of Carnegie Mellon.

We are highlighting this work because the new neural network class that we will be introducing soon is best seen as a parallel or indirect pathway, such as the kind referenced here in the discussion of basal ganglia neurons.

Our results suggest that negative reinforcement, not motor suppression, is the primary behavioral correlate of signaling along the indirect pathway, and that the motor suppressing effects of A2A-SPN stimulation are driven through inhibitory collaterals within the striatum.”

Isett, Brian R. 2023. See full citation below.

Here’s a layman’s overview:

Here’s the original research article:

  • Isett, Brian R., Katrina P. Nguyen, Jenna C. Schwenk, Jeff R. Yurek, Christen N. Snyder, Maxime V. Vounnatsos, , Kendra A. Adegbesan, Ugne Ziausyte, and Aryn H. Gittis. 2023. “The indirect pathway of the basal ganglia promotes transient punishment but not motor suppression.” Neuron (May 16, 2023) S0896-6273(23)00302-1.. doi: 10.1016/j.neuron.2023.04.017. (Accessed May 17, 2023; available online at https://www.biorxiv.org/content/10.1101/2022.05.18.492478v1.full.)


Wrapping Things Up: What to Read, Watch, or Listen-to Now

AJM’s Note: During the course of preparing both this blogpost and the accompanying YouTube, I’ve listened to a (growing) number of YouTube interviews and presentations on AI, along with the full Senate Judiciary Hearing where Sam Altman, CEO of OpenAI, testified along with Christina Montgomery (Vice President & Chief Privacy & Trust Officer, IBM, and Gary Marcus, Professor Emeritus, New York University). (For the Themesis compilation of resources surrounding that Senate Hearing, see: https://themesis.com/2023/05/18/conversation-sam-altman-openai-ceo-testifies-before-senate-judiciary-committee/.)

Of all of these, some of the YouTube podcasts that I’ve enjoyed the most have been Lex Fridman’s interviews with various AI notables. These interviews – while having some elements where both parties just riff on a theme – benefit from Fridman’s careful and exhaustive preparations, leading to a detailed set of thought (and conversation-) provoking questions.

This Fridman interview with Sam Altman is one of my favorites. Altman is remarkably comprehensive and nuanced as he addresses important issues regarding the future of AI.

  • Fridman, Lex, with Sam Altman. 2023. “Sam Altman: OpenAI CEO on GPT-4, ChatGPT, and the Future of AI | Lex Fridman Podcast #367.” Lex Fridman YouTube Podcast Series. (March, 2023). (Accessed May 17, 2023; available online at https://www.youtube.com/watch?v=L_Guz73e6fw&t=3336s.)

AJM’s Note: Personally, I’m putting my energy-investment egg into studying Hafner et al.’s paper on “Action Perception Divergence,” see the reference in “Artificial General Intelligence.” I think this is truly the best vector into AGI.

For may who wish to study that same paper, the depth of prior knowledge needed – in the Kullback-Leibler divergence, free energy, variational methods, and all things Fristonian (e.g., active inference), can be more than daunting.

Themesis published a twelve-post series on those topics, and capped it off with a “Resource Compendium” (https://themesis.com/2022/11/16/variational-free-energy-resource-compendium/).

The best starting point, of course, would be the beginning of that series, “The Kullback-Leibler Divergence, …” (https://themesis.com/2022/06/28/the-kullback-leibler-divergence-free-energy-and-all-things-variational-part-1-of-3/.)

Bon appétit!

Alianna J. Maren, Ph.D.

Founder and Chief Scientist

Themesis, Inc.

May 30, 2023

1 comment

Comments are closed.

Share via
Copy link
Powered by Social Snap