CORTECONS: A New Class of Neural Networks

In the classic science fiction novel, Do Androids Dream of Electric Sheep?, author Philip K. Dick gives us a futuristic plotline that would – even today – be more exciting and thought-provoking than many of the newly-released “AI/robot as monster” movies.

Figure 1. Cover image of first edition release of Do Androids Dream of Electric Sheep?, a 1968 novel by Philip K. Dick.

The key question today is: Can androids dream?

This is not as far-fetched as it might seem.

Many people have (mistakenly) thought that LLMs (large language models) were actually “thinking.”

Of course, LLMs can’t dream – and they can’t really think. But as Lex Fridman surmises, in one of his interviews with Sam Altman, CEO of OpenAI, “I think it [GPT-4] knows how to fake consciousness.”

Excerpt from a Lex Fridman interview with Sam Altman, CEO of Open AI. The conversation about whether GPT-4 is conscious begins at about 50 seconds into the clip.

The fact that we can even ask this question (about AIs and consciousness, or dreaming, or even fluid thinking) – and have this question relevant to a specific AI system – is far different from free-wheeling a potential universe filled with sentient (or “fake-sentient”) AIs.

The reason that LLMs can’t be “conscious,” or even dream, is inherent in their architecture.

LLMs are autoregressive engines, and they can assemble (one element at a time) strings of elements that “look like” conscious thought – drawing on a huge reservoir of stored sequential elements (such as the entirety of Wikipedia and many other sources).

HOWEVER … the burning question now is: What would it take for an AI to get even one step closer to real AGI?

We’re not talking true “consciousness” here.

We’re talking about a more expanded realm of capabilities than is offered by deep learning (DL), convolutional neural networks (CNNs), autoregressive engines (LLMs), variational autoencoders (VAI), and even reinforcement learning (RL).

This brings is to a new class of neural networks: CORTECONs.

A New Neural Network Class

I am so excited to present an overview of CORTECONs (COntent-Retentive TEMporally-CONnected neural networks), a new class of neural network that allows for more interesting temporal behaviors than are possible with the current neural networks classes.

This YouTube presents the CORTECON architecture and equations, as well as how they can be used with variational inference to advance artificial general intelligence (AGI).

Maren, Alianna J. 2023. “CORTECONS: Architecture, Equations, Connection with AGI.” Themesis, Inc. YouTube Channel (Sept. 11, 2023). (Accessed Sept. 11, 2023; available online HERE.)

I began work on CORTECONs in 1992, and after a few conference presentations, had to “backburner” it until 2014, when CORTECON evolution became my life’s great passion once again.

One of the most important things that had to happen – to make CORTECONs effective – was to figure out how to identify the parameters (epsilon0, epsilon1) that would bring a model into closest alignment with a representation grid. The tricky thing was – these parameters had to correspond to a free energy minimum in the CORTECON (1-D or 2-D cluster variation method) phase space.

This required creating a new divergence method (Maren 2021).


CORTECONs

CORTECONs are a new class, with a new fundamental mode of behavior – free energy minimization across the latent layer.

Figure 2: CORTECONs (COntent-Retentive TEMporally-CONnected neural networks) are a new class of neural network.

CORTECONs use their laterally-connected neuron layer (hidden or “latent” nodes) to enable three new kinds of temporal behavior: 

  1. Memory persistence (“Holding that thought”) – neural clusters with variable slow activation degradation allow persistent activation after stimulus presentation,
  2. Learned temporal associations (“That reminds me …”) – neural clusters with slowly degrading activation can form associations with newly-activated node clusters, thus creating temporally-based associations, and
  3. Random activation foraging (“California Dreamin’ …”) – when only “noise” or partial stimulus elements are presented, the network is able to free-associate and move among various previous learned states, and potentially also create new, stabilized activation patterns. 

In short, CORTECONs are designed to make associative processes, such as “dreaming,” possible. They will be able to uncover new associations between (sequences of) presented input patterns, and thus achieve new “insights,” as manifested in new activation patterns in the latent layer, and (as a result) new activations in the output layer.


Two Kinds of Dynamic Behavior

The defining feature of CORTECONs is that they allow for two kinds of dynamic behavior:

  • “Vertical” signal-to-output node activations, following the mechanisms established for Multilayer Perceptrons (MLPs) or Restricted Boltzmann Machines (RBMs), and also
  • “Lateral” free energy minimization within a (relatively large) set of laterally-organized latent nodes.

CORTECON Architecture

The key and distinguishing feature of CORTECONs is that the latent variable layer can itself be brought to a free energy minimum, using the cluster variation method (CVM). This latent layer is composed of a relatively large number of nodes whose locations are fixed in a specific grid, as illustrated in the following Figure 3 for a CORTECON with a 2-D latent layer.

Figure 3. An illustrative CORTECON architecture, where the latent layer consists of a 2-D grid whose nodes are arranged in a specific zigzag pattern. This grid can be brought to a free energy minimum by interchanging the locations of active nodes.

The notation used here shows the CORTECON inputs and outputs as elements of a Markov blanket, keeping consistent with notation introduced by Karl Friston (2013, and Friston et al. 2015, see also Maren, 2022). The input layer corresponds to “sensing units,” and the output layer corresponds to “active units.”


Free Energy Minimization in the Latent Layer

The basis for the new class of neural networks, or CORTECONs, is a latent variable layer where:

  • There are a (relatively speaking) large number of nodes, and
  • The activations of the nodes in this layer can be governed by BOTH inputs from an input (or visible) layer AND ALSO a free energy minimization process across this lateral layer.

We will say that free energy minimization across this lateral latent layer makes this a dynamic neural network, as compared with a generative neural network. (Yes, we know that the term “dynamic” has been co-opted for a form of learning over time, but believe it is more appropriately used here. In a gesture of good will towards previous research, we’ll say that CORTECONs are “laterally dynamic” systems.)

What makes free energy minimization across the lateral layer possible is that we use an Ising equation with a somewhat more complex entropy term. This entropy formulation was first introduced by Kikuchi (1951) and further advanced by Kikuchi and Brush (1967). This method is called the cluster variation method (CVM).

The interesting thing about a CVM is that we can change the free energy of the system simply by swapping out the activations of two nodes. This is very different from the way in which we’ve thought of free energies in prior neural networks applications.

We illustrate node-swapping in the following Figure 4, for the case of a 1D (single zigzag chain) CVM grid.

Figure 4. Illustration of a 1-D (single zigzag chain) CVM (cluster variation method) grid. Bottom rows: before two nodes were swapped (user choses two nodes of different activations). Top rows: After the two nodes are swapped. Interactive code will be available soon. (Currently being debugged.)

In Figure 4, we have the same number of “on” and “off” nodes in each of the two 1D CVM grids (single zigzag chains). The upper portion shows the activations after two of the nodes in the lower figure have been swapped.

Although we are not showing the free energies associated with the two 1D CVM grids in Figure 4, they are different – even though they each contain the same number of “on” and “off” nodes!


1D vs 2D CVM Grids

Figure 4 illustrated two instances of a 1D CVM grid (constructed as a single zigzag chain, composed of two offset rows of nodes). This is the simplest possible version of a CVM grid.

We will prefer using a 2D grid for CORTECONs. This is because a 2D grid offers more versatility – and the potential for interesting clustering behaviors. However, we introduce the CORTECON architecture with 1D CVM grids, because both the underlying equations and the corresponding code are simpler.

Once we’ve sufficiently addressed how a 1D CVM can be used, we’ll progress to 2D CVM grids and their respective equations, phase spaces, grid topographies, and the associated codes.

Whether we use a 1D or 2D CVM grid, the underlying notion is the same: we can obtain useful and interesting behaviors because the entropy term in a CVM system is more complex than the entropy term in a classic, simple Ising model.

Figure 5 illustrates a 2D CVM grid, with before-and-after depictions of node activations. The “before” case is the original data representation, which is a manually-created attempt to create a “scale-free” suite of clusters. The “after” case is after the 2D CVM grid has been brought to a free energy minimum, for a specific case of enthalpy parameters. (See related figures and experimental details in (Maren 2021)).

Figure 5. A 2D CVM grid, with equal numbers of “on” and “off” units in each grid. The grid in (a) is a manually-created “scale-free” set of clusters, and the grid in (b) is that grid after being brought to a free energy minimum for a specific pair of enthalpy parameters. See (Maren 2021) for related work and experimental methods.

In the following sections, we briefly review the formulation of the free energy term, and identify how the entropy is expressed for a CVM system.

Free Energy: The Classic Ising Model

We know that the free energy of a system is a function of two variables, the enthalpy and the entropy. (In neural networks and variational inference applications, we typically divide through by the temperature, which gives us a reduced equation with just these two variables.) The free energy equation is shown in the following Figure 6.

Figure 6. (The reduced) free energy is given as the enthalpy minus the entropy.

The Enthalpy Term

In the classic Ising equation for free energy, the enthalpy is a function of two terms:

  • The activation enthalpy, or which is typically a linear function of the relative fraction of nodes in the “on” or “active” state, and
  • The interaction enthalpy, which is typically a function of the square of those nodes that are “on” or “active.”

This is shown in the very standard depiction of an Ising equation offered by the venerable Wikipedia, as shown in Figure 7.

Figure 7. The enthalpy for a classic Ising equation includes two terms: interaction enthalpy (left-most term) and activation enthalpy (right-most term).

When we keep the number of nodes in the “on” state constant, as we do when we swap out two nodes that are respectively in the “on” and “off” states, then we keep the activation enthalpy term constant. Thus, node-swapping does not impact the activation enthalpy.

In the example that we are using, we will also have the interaction enthalpy parameter set to zero, so that node swapping does not affect the interaction enthalpy term.

This means that when we interchange two nodes, one each from the “on” and “off” states, we are not affecting the enthalpy term in the free energy.

Let’s think about this relative to what we already know about energy-based neural networks. All energy-based neural networks (restricted Boltzmann machines (RBMs) and all their derivatives) operate by adapting connection weights between the visible and hidden nodes, as the network is trained with patterns that have the visible nodes in various “on” and “off” binary activations.

The entropy term for a RBM is fixed. It is determined when the training data set is assembled. This is why classic papers on energy-based neural networks make a point of discussing “Gibbs sampling” and “Monte Carlo Markov chains.” These are both methods for ensuring that the entropy term contains appropriate representations of the pertinent training data. (See, e.g., Hinton 2002, and Hinton et al. 2012.)

The Entropy Term

The entropy term is hugely important in energy-based neural networks and variational inference because the negative of the entropy provides the system with a baseline concave equation – one where there is a clear minimum. Thus, methods that drive a system towards a “free energy minimum” have the advantage of a clear minimum provided by the negative of the entropy, as shown in the following Figure 8.

Figure 8. Negative entropy for a bistate system, with an obvious minimum when there are equal numbers of nodes (or elements) in the two possible states.

In a typical bistate system, one that has only nodes that can be “on” or “off,” the entropy is very simple. It can be expressed as a function of a single variable, x, which is the fraction of “on” nodes in a system.

The various enthalpy terms adjust the location of the free energy minimum, but the negative entropy term provides the essential concave function.

In energy-based neural networks, the entropy is determined by the data collection. Once the data has been assembled, the entropy is implicit (not directly calculated); it impacts the training and free energy minimization by including all of the necessary data types, in their appropriate proportions. Thus, the entropy is not a variable in these systems – it is not something that can be adjusted by tweaking parameters.

Entropy in the Cluster Variation Method

In contrast, the entropy in a CVM system can – and does – change, depending on the positions of the active nodes and their relationship to each other.

In traditional energy-based neural networks, there is no concept or representation of a “spatial arrangement” of the latent (hidden) nodes. They do not have any interactions with each other. Even the potential for latent node interactions (or consideration of their relative spatial activation patterns) was removed when we went from simple Boltzmann machines to restricted Boltzmann machines, where the only connections are between “visible” and “hidden” (latent) nodes.

In contrast to this, the spatial configuration of latent nodes in a CVM grid is very important. The local patterns, expressed via a set of local configuration variables, define the entropy.

We can see these configuration variables in the following Figure 9.

Figure 9. The four different kinds of local configuration variables used in the cluster variation method.

The free energy for a 1D CORTECON is show in in the following Figure 10, taken from (Maren, 2016) and based on prior work by Kikuchi (1951) and Kikuchi and Brush (1967). The first term is for the enthalpy. (This is a simple one-parameter enthalpy; the equation is written for the case where the activation enthalpy is set to zero, so we are seeing only the interaction enthalpy term.)

Figure 10. The free energy equation for a 1D CVM (Maren 2016).

The next two terms (involving summations) are for the entropy; it is computed using these two terms – one involving nearest-neighbors (the y‘s) and another involving triplets (the z‘s).

The final two terms are used for the Lagrange method, and are not really needed when we start creating our computational solutions.

Only one real “control” parameter – epsilon – is shown here (multiplying a sum of z‘s, which gives the enthalpy term).

The 2D CVM is a bit more complex; Figure 11 shows the entropy for the 2D version.

Figure 11. The entropy term for the 2D CVM (Maren 2019, 2021, based on Kikuchi 1951 and Kikuchi and Brush 1967). There are four terms in the 2D CVM entropy component, as compared with just two entropy terms for the 1D CVM.

Thus, in CORTECONs, we introduce free energy minimization across a layer of latent nodes, where the local activation patterns not only affect the enthalpy of the system, but also the entropy. This means that when we interchange the position of two nodes (one “on,” and the other “off”), resulting in different local patterns, we change the free energy of the system.


Two Primary Control Parameters

As with a simple Ising model, there are just two parameters used in a CVM equation, whether 1D or 2D.

  • Epsilon0 – the activation enthalpy parameter – controls the fraction of nodes or units that will be in the “on” state. When this parameter is set to zero, we have an equal number of “on” and “off” nodes.
  • Epsilon1 (or a related parameter, which I call the h-value) – controls how having nearest neighbors of like nature reduces the overall free energy. The greater value that we use (>0 for epsilon1, and >1 for the h-value), the more that the overall free energy is reduced when we have more “clustering” of like-with-like.

All experiments right now, and associated papers, blogposts, and YouTubes, are being conducted for the case where epsilon0 = 0, because there are analytic solutions for both the 1D and 2D CVM free energy equations for this epsilon0 = 0 case. This means that we can compare our computational results against the analytic, providing a test for both the computational and analytic solutions.

After we’ve covered the basics, we’ll start to address the (epsilon0, epsilon1) phase space. This means that we can control how many nodes are “on,” and what the overall clustering nature is like, just with these two parameters.

As we map out the phase space, we’ll find that we can evolve both the fraction of active nodes and also their clustering behavior through a trajectory. This means that we can, more-or-less smoothly, control the overall behavior of the system.

This does not give us control of the exact nodes that will be “on” or “off.” To do that, we would continue to use the vertical connections that we would create using either backpropagation or Boltzmann machine training. (For the sake of creating very transparent models, we will use backpropagation for early CORTECONs.)

The (epsilon0, epsilon1) parameter pair gives us overall control. We can control nuanced behaviors through a richer set of models. Specifically, once a node is no longer receiving direct activation stimulus from the input layer, it can persist in an active state as a function of:

  • A basic decay rate established with a half-life, and
  • More complex activation decay influence by inputs from short-range and long-range lateral connections.

Thus, with a combination of overall system control determined by (epsilon0, epsilon1) and local activation and interactions mediated by activation decay control and inter-node activations, we can induce a range of behaviors.

We will describe the behaviors more fully in future blogposts and articles, together with accompanying YouTube presentations.


Parallel Computational Pathway

CORTECONs can work as a parallel processing pathway, alongside known architectures such as Multilayer Perceptrons (MLPs) or Restricted Boltzmann Machines (RBMs). Naturally, CORTECONs can be inserted at multiple levels within a deeper architecture.

In this blogpost, and in ones to come, we address the fundamental building block for a larger-scale architecture. Thus, we consider only the case where a single CORTECON works alongside a single traditional architecture, e.g., an MLP.

The notion of having parallel processing pathways is very well known within brain science. This provides an inspiration for how CORTECONs can be used in tandem with other architectures.

For a recent discovery of a new parallel processing pathway in the brain, see the works of Aryn Gittis and her group, presented recently in a work by Isett et al. (2023). Carolyn Sheedy (2023) presents a layman’s summary on the significance of this work.


Code Status

The current code is in the GitHub repository “simple-1D-CVM-w-Python-turtle-1pt4pt2-…” (see GitHub link in Resources and References, below). This code allows the user to perform an interactive swap. However, it does not yet compute the free energy for the two different grid configurations.

The current code contains the full method for creating the Node object “wLeft” attribute. This is detailed in the corresponding YouTube code walkthrough, “CORTECONS: 1-D Cluster Variation Method Code Walkthrough – Part 1 (wLeft).” (See YouTube link in “Prior Related YouTubes” in the Resources and References section.)

The current code does not contain the remaining local configuration variables, e.g., wRight, and the two y and z variables, that need to be computed and associated with each Node object. This will be done in future code releases.

This code is second-generation code. It is object-oriented. We accomplished the same free energy minimization task that is our first objective in early, first-gen code (written in straightforward structured Python). However, we need an object-oriented approach in order to do the more interesting things … hence we are recasting our original code into an O-O framework, and are creating tutorial presentations (see the YouTube in Resources and References) as we do so.



Live free or die,” * my friend!

* “Live free or die. Death is not the worst of evils.” – attrib. to U.S. Revolutionary War General John Starck. https://en.wikipedia.org/wiki/Live_Free_or_Die

Alianna J. Maren, Ph.D.

Founder and Chief Scientist

Themesis, Inc.



Resources and References

Prior Related YouTubes

This Themesis YouTube introduces the object-oriented code for a 1D CVM grid, which is a preparation study for working with a 2D CVM grid. We present a code walkthrough for computing the denotation for a single node attribute, wLeft.

Maren, Alianna J. 2023. “CORTECONS: 1-D Cluster Variation Method Code Walkthrough – Part 1 (wLeft).” Themesis, Inc. YouTube Channel (Aug. 30, 2023). (Accessed Sept. 5, 2023; available online at https://www.youtube.com/watch?v=MChZ83EE3EY.)

We introduced key concepts creating a framework for CORTECONs in this YouTube.

Maren, Alianna J. 2023. “Key Concepts for a New Class of Neural Networks.” Themesis, Inc. YouTube Channel (June 30, 2023). (Accessed Sept. 5, 2023; available online at https://www.youtube.com/watch?v=xR2oKtUrh3I&t=67s.)

Prior Related Blogposts

This blogpost accompanies the code walkthrough YouTube identified in the previous section.


Themesis GitHub Repository

AJM’s Note: This is new. This is the first time that we are pointing you to a code repository – with the intention that this will not just be code, but also PPTs (MS-PPTXTM) of how the code for computing local variables is based on the CVM (cluster variation method) grid architecture, code walkthroughs, worked examples, etc.


Readings

Do Androids Dream?

  • Dick, Philip K. 1968 (First Ed.) Do Androids Dream of Electric Sheep? New York: Doubleday. (See plot summary on Wikipedia.)

2-D Cluster Variation Method: The Earliest Works (Theory Only)

  • Kikuchi, R. (1951). A theory of cooperative phenomena. Phys. Rev. 81, 988-1003, pdf, accessed 2018/09/17.
  • Kikuchi, R., & Brush, S.G. (1967), “Improvement of the Cluster‐Variation Method,” J. Chem. Phys. 47, 195; online as: online – for purchase through American Inst. Physics. Costs $30.00 for non-members.

1-D Cluster Variation Method: Computational Result

  • Maren, A.J. (2016). The Cluster Variation Method: A Primer for Neuroscientists. Brain Sci. 6(4), 44, https://doi.org/10.3390/brainsci6040044; online access, pdf; accessed 2018/09/19.

2-D Cluster Variation Method: Experiments

  • Maren, Alianna J. 2019. “2-D Cluster Variation Method Free Energy: Fundamentals and Pragmatics.” arXiv:1909.09366v1 [cs.NE] 20 Sep 2019. (pdf)
  • Maren, Alianna J. 2021. “The 2-D Cluster Variation Method: Topography Illustrations and Their Enthalpy Parameter Correlations.” Entropy 23(3): 319; https://doi.org/10.3390/e23030319 (Accessed Sept. 6, 2023; available online at pdf.)

Hinton on Energy-Based Neural Networks

  • Hinton, Geoffrey, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: Four Research Groups Share Their Views.” IEEE Signal Processing Magazine 2 (November, 2012). (pdf)
  • Hinton, G.E. 2002. “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771–1800. (pdf)

Karl Friston and Active Inference (2013, 2015, and Maren on Variational Methods (2019)

AJM’s Note: Friston’s 2013 paper is the central point for theoretical (and mathematical) development of his notions on free energy in the brain, and in any living system. He starts with the notion of a system separated by Markov boundary from its external environment. Moves on from there. This blogpost series is largely focused on this paper, buttressed with Friston et al. (2015).

  • Friston, Karl. 2013. “Life as We Know It.” Journal of The Royal Society Interface10. doi:10.1098/rsif.2013.0475. (Accessed Oct. 13, 2022; pdf.)

AJM’s Note: Friston and colleagues, in their 2015 paper “Knowing One’s Place,” show how self-assembly (or self-organization) can arise out of variational free energy minimization. Very interesting read!

  • Friston, K.; Levin, M.; Sengupta, B.; Pezzulo, G. 2015. “Knowing One’s Place: A Free-Energy Approach to Pattern Regulation.” J. R. Soc. Interface12:20141383. doi:10.1098/rsif.2014.1383. (Accessed Oct. 3, 2022; pdf.)

AJM’s Note: If you’re going to read Friston (2013) or Friston et al. (2015), you’ll likely need a bit of an exposition to take you from Friston’s rather terse equations to a more in-depth understanding.

The best way to understand Friston’s take on variational free energy is to go back to Matthew Beal’s dissertation. (See this blogpost.)

And the best way to understand Friston and Beal together is to read my own work, in which I cross-correlate their notation. (This is best done while reading the five-part blogpost sequence on variational free energy; see the Blogposts section previously in Bunny Trails, and for that, see this blogpost.)

This arXiv paper is one that I wrote, mostly for myself, to do the notational cross-correspondences.

  • Maren, Alianna J. 2022. “Derivation of the Variational Bayes Equations.” Themesis Technical Report TR-2019-01v5 (ajm)arXiv:1906.08804v5 [cs.NE] (4 Nov 2022). (Accessed Nov. 17, 2022; pdf.)

This arXiv paper (Themesis Technical Report) is the source for the famous two-way deconstruction of the Kullback-Leibler divergence for a system with a Markov blanket, which is central to Friston’s work.

In this paper, I present a new divergence method that enables CORTECONs to model 2-D grids, where the CORTECON model is itself at a free energy minimum.

  • Maren, Alianna J. 2022. “A Variational Approach to Parameter Estimation for Characterizing 2-D Cluster Variation Method Topographies.” Technical Report THM TR2022-001 (ajm).  arXiv:2209.04087v1 [cs.NE] (9 Sep 2022). (Accessed Sept. 11, 2023; pdf.)

Sajid et al. on using the 2-D cluster variation method to model cancer niches.

  • Sajid, Noor, Laura Convertino, and Karl Friston. 2021. “Cancer Niches and Their Kikuchi Free Energy.” Entropy (Basel) (May 2021) 23(5): 609. doi:10.3390/e23050609 (Accessed Sept. 11, 2023; pdf.)

This is a very interesting article by Danijar Hafner et al., advocating “action-perception divergence.”

  • Hafner, Danijar, Pedro A. Ortega, Jimmy Ba, Thomas Parr, Karl Friston, and Nicolas Heess. 2022. “Action and Perception as Divergence Minimization.” arXiv:2009.01791v3 [cs.AI] (13 Feb 2022). (Accessed Sept. 11, 2023; pdf.)

A New Parallel Processing Pathway in the Brain

Here’s a layman’s overview:

Here’s the original research article:

  • Isett, Brian R., Katrina P. Nguyen, Jenna C. Schwenk, Jeff R. Yurek, Christen N. Snyder, Maxime V. Vounnatsos, , Kendra A. Adegbesan, Ugne Ziausyte, and Aryn H. Gittis. 2023. “The indirect pathway of the basal ganglia promotes transient punishment but not motor suppression.” Neuron (May 16, 2023) S0896-6273(23)00302-1.. doi: 10.1016/j.neuron.2023.04.017. (Accessed May 17, 2023; available online at https://www.biorxiv.org/content/10.1101/2022.05.18.492478v1.full.)
Share via
Copy link
Powered by Social Snap