Steps Towards an AGI Architecture (“Aw, Piffle!”)

In the last blogpost and accompanying YouTube, we ventured the opinion that Google was using two “representation levels” to create its lovely Veo-2 generative AI video capabilities.

It now appears that we (that’s the “editorial” we) may have been wrong.

Ah, piffle!

(And damn, and overall disappointment at our end.)

BUT … the good that has come from this is – several of you offered excellent articles and opinions that help us all, collectively, round out our knowledge of what it takes to create an AGI that works with image/video representations. (Be they single-level, as in a straightforward generative AI, or “Gen-AI” as we like to call it, or more robust and complex, involving at least two representation levels.)

In this blogpost, we’re going to quickly review the articles that were offered, and include one more that is highly pertinent – a discussion of a “neuro-symbolic” architecture. We’ll start with that one.

AGI (Neuro-Symbolic) Architectures

One of the common phrases indicating a multi-level representation approach for AGI is “neuro-symbolic.” If you use that as a search keyword, you’ll find a variety of publications – typically more with a “this is important and we should do this” approach than with solid information.

One very nice stand-out-from-the-crowd was an article by Wan et al. (2024) on neuro-symbolic architectures. (See the Resources and References section for full citation.) This is a useful study, because it addresses the architectural/computational challenges in AGI creation.

Wan et al. note in their Abstract that:

“Our studies reveal that neuro-symbolic models suffer from inefficiencies on off-the-shelf hardware, due to the memory-bound nature of vector-symbolic and logical operations, complex flow control, data dependencies, sparsity variations, and limited scalability. Based on profiling insights, we suggest cross-layer optimization solutions and present a hardware acceleration case study for vector symbolic architecture to improve the performance, efficiency, and scalability of neuro-symbolic computing.”

What we like most about the Wan et al. approach is that it is a coherent and logical effort to address the problems associated with simple Gen-AI methods – problems that will not be overcome simply by adding in more and more processing units (of the same kind).

In other words, “Stargate” will not solve the problem.

Nor will many of the (useful, but still limited) efforts to create smaller Gen-AI systems that draw from larger, “foundation” models – because the problems inherent in the large-scale systems will be perpetuated in the smaller-scale ones.

A different approach is needed.

As Wan et al. offer in their Introduction:

“Neural methods are highly effective in extracting complex
features from data for vision and language tasks. On the other
hand, symbolic methods enhance explainability and reduce
the dependence on extensive training data by incorporating
established models of the physical world, and probabilistic
representations enable cognitive systems to more effectively
handle uncertainty, resulting in improved robustness under
unstructured conditions.“

Rather than go into more detail, we gently encourage the interested reader to consult the original Wan et al. (2024) article.

These authors do point out the computational difficulties inherent in creating multi-representation AGI systems, which is very likely the key reason that we’ve not seen AGI to date. (And we were hopelessly optimistic in thinking that Google had solved this problem with Veo-2.)

AGI Representation Levels

We’ve been discussing AGI architectures in terms of representation levels for some time. One vid, where we went into the various necessary components, is from

Maren, Alianna J. 2024. “AGI: An Essential Architecture.” Themesis, Inc. YouTube Channel (March 29, 2024). (Accessed Jan. 28, 2024; available at Themesis YouTube Channel.)

Where Will We Find Some REAL AGI? (And WHEN?)

The question is: if we are NOT getting “REAL” AGI from Google’s Veo-2 (meaning, we don’t even have two-level representation yet), then where should we be looking for the first AGI instances? And at the same time, when is it likely that we’ll see our first AGI? What will be the “leading indicators?”

This is important, because it helps us focus our attention.

Let’s start with a really interesting article by Ian Hogarth (2024) that asks these questions from a practical, “purse-strings” perspective:

“… the global race to build artificial general intelligence was initiated by a London-based start-up, DeepMind, founded in 2010 — well before Anthropic or OpenAI existed.”

Mr. Hogarth goes on to identify the seven “trillion-dollar tech companies in the world — Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia and Tesla.” He notes (with regard to these companies) that “the economic benefits, not to mention national security and geopolitical benefits, of these companies are extraordinary — they’re a wealth engine to pay for other things a country might want to do.”

And that’s the crucial point, isn’t it? Creating an AGI – even if we don’t have one yet, and even if Veo-2 is NOT what your author had (far too blithely hoped and assumed) it could be – is going to take money.

Lots and LOTS of money.

The discussion by Wan et al. (2024, see two sections above and also the References and Resources) made it clear that creating a functional AGI was going to require a functional, AGI-specific architecture – and that architecture was going to require diverse forms of computational processing. This means that whatever any team devises in terms of an AGI design and desk-top implementation (including combination of diverse chips, etc.) will need a much more robust and complex physical architecture – new means for getting the required computations done at the desired speed.

Jumping over (for now) the issue of who exactly we will turn to for those chips and architectural packages, it’s clear that the investment needs to be in BOTH the AGI algorithmic and system design and ALSO into the hardware/architecture platform.

Money needs to go in BOTH directions.

Again, lots and LOTS of money.

This won’t come from an academic research project.

This will ONLY come from a high-level, concerted intention to build not only the first desk-top AGI, but the first rudimentary practical implementation.

We’re looking for a new origin story. This “origin story” will require BOTH money AND firm direction as well as a substantive high-level vision; e.g., a “landing a man on the moon”-level vision. The only way that we’ll obtain that level of both funding and vision will come from a combination of government and industry resources.

Figure 1. Creating the first functional AGI will be equivalent to our first moon landing.

It could, conceivably, come from a single company – an Alphabet, a Meta. Or from NVIDIA or Tesla. But on their own, singularly, that’s not too likely. The bottom line reason is that it takes a combination of talents beyond just the best that an industry think-tank can offer. (E.g., beyond what we’ll get from DeepMind or Meta’s FAIR.)

Along these lines – the email conversations between Elon Musk and various parties at OpenAI are very interesting, even if (by now) seven years old; see “OpenAI is on a Path of Certain Failure” (techemails.com).

The Alternative to a Government/Industry AGI Partnership

There’s only one way in which AGI can emerge that will NOT involve direct government oversight and direction – and that would be if a single company has not so much sufficient financial leverage, but a strong enough and determined enough leadership to essentially FORCE AGI development.

This leadership must be in a position to strongly and firmly nudge chip/hardware resources into line with the algorithmic/systems development.

(This also means letting go of the current generative AI (“Gen-AI”) obsession, as we all know that this will not, on it’s own, bring us anything like AGI.)

So we take a look some comments made by Andrej Karpathy (a researcher whom I’ve admired for years, previously with Tesla, and currently with Eureka Labs), written January 31, 2018 in an email to Musk: “… I also strongly suspect that compute horsepower will be necessary (and possibly even sufficient) to reach AGI. If historical trends are any indication, progress in AI is primarily driven by systems [bold by AJM] – compute, data, infrastructure. The core algorithms we use today have remained largely unchanged from the ’90s. Not only that, but any algorithmic advances published in a paper somewhere can be almost immediately re-implemented and incorporated.
“Conversely, algorithmic advances alone are inert without the scale to also make them scary.”

This was followed by an email from Musk to Ilya Sutskever and Greg Brockman: “Andrej is exactly right. We may wish it otherwise, but, in my and Andrej’s opinion, Tesla is the only path that could even hope to hold a candle to Google. Even then, the probability of being a counterweight to Google is small. It just isn’t zero.“

In Summary, So Far …

The article by Ian Hogarth (see Refs and Resources) identified the seven tech companies with over a billion dollars in revenue:

Alphabet,
Amazon,
Apple,
Meta,
Microsoft,
Nvidia and
Tesla.

Of these, and following Musk’s and Karpathy’s suggestion that the ONLY viable AGI options are Google (Alphabet) and Tesla (now X, or the collection of Musk’s X-companies), we can quickly scan the list and see that this is likely correct.

Amazon could (and likely will) create some form(s) of AGI – but very likely will use these in-house, to bolster their own processes. They’re not in the habit of offering AI services – even though they have significantly boosted Anthropic. But Anthropic, even with a (reinforcement-learning-based reasoning component attached to its LLMs) is not positioned to be a deep player in AGI.

Apple has not been an AI OR a Gen-AI player so far, and is unlikely to pursue AGI – it is not a part of Apple’s focus on consumer electronics.

Meta COULD. And likely WILL. But it is more likely that Meta will use its AGI in-house, to bolster its core mission of profiling users so that it can leverage individualized marketing, which is where it makes its money. Zuckerman’s recent focus has been on augmented reality (AR), with initial access via Orion glasses. Taking this one step further, Yann LeCun, Meta’s outspoken Chief Scientist for AI, has been talking about AGI for some years – but we’re not seeing Meta put together the kind of hardware resources that would support large-scale AGI.

Microsoft has been focusing on integrating AI capabilities into its Copilot (Pesce, 2024). It has taken the direct business-revenue approach of strapping a reinforcement learning engine onto its LLM, in direct competition with Salesforce (Kalimeris, 2024). Although it IS invested in chips/hardware, this seems more leveraging its existing capabilities than building an AGI framework.

NVIDIA. They definitely COULD do this – but it would be building towards an AGI algorithms-suite/architecture FROM a hardware base, rather than starting with the AGI algorithms and system and then working back to the hardware that would support this. Not sure this is the NVIDIA direction.

Tesla/xai. Yeah, this could happen. This could happen, very realistically. And Musk – in addition to his own (substantive) resources – has just raised an extra $6 B in a series-C round for xai.

We’ll discuss this more in future blogposts – and in the vid that will accompany THIS blogpost.

For now, we’re going to drop back to some technical specifics, and look at some articles offered by (1) a viewer of the previous YouTube and (2) a Northwestern student, each of whom suggested some very useful articles.

Back to the Veo-2 YouTube Vid: Three Perspectives on Video Generation

One viewer (@hjups) of the previous YouTube offered the following comment:

Do you have a citation for the claim that Veo-2 is using a two-form representation approach / utilizes any form of explicit “physically-realistic object model”? From my understanding, its architecturally is a single level latent diffusion transformer trained with standard regression. The generation results would then be products of dataset curriculum and model scale, where any apparent “world modeling” would be entirely emergent. However, recent works have shown that impressive as these video models are, they only have the appearance of physical consistency but are unable to generalize physics well (Kang et al., 2024; Motamed et al., 2025). At one point I thought Veo-2 was using Neural Scene Representations (Eslami et al., 2018), but some of the generated samples lack the expected scene level consistency of such an approach (object persistence failures with camera movement).

OK, first – a bit shout-out and thank you to @hjups! This was an appropriate response to that YouTube, and the comments were thoughtful and backed up by citations!

And, while digging in, we’ve been (reluctantly and sadly) forced to conclude that Veo-2 might be using a standard, old-fashioned Gen-AI approach. NO explicit communication with a “physically-realistic object model” representation layer. (Once again, “Ah, piffle!” and “Damn!.”)

But in the interests of tracing down what WOULD make for good, AGI-level, robust video interpretation and generation, let’s look at the three references offered by @hjups (and see the Resources and References for full citations and links):

Eslami et al., 2018
Kang et al., 2024,
Motamed et al., 2025, and

Eslami et al. (2018) on “Neural Scene Rendering and Representation”

The article by Eslami et al. (2018) addresses three key elements that we’ve seen in Piloti et al.’s work (2018, 2022) and also in Gen-AI:

Building a “physically-realistic object model,”
Predicting the appearance of an object from an angle which was NOT included in the training data, and
Doing this without the explicit intervention of human labeling.

They summarize their approach as:

“Generative Query Network (GQN), a framework within which machines learn to represent scenes using only their own sensors. The GQN takes as input images of a scene taken from different viewpoints, constructs an internal representation, and uses this representation to predict the appearance of that scene from previously unobserved viewpoints.”

What is particularly interesting about the Eslami et al. work is that it was published JUST BEFORE the “transformer revolution;” i.e., it was published in 2018 (with a lot of good references to the work up to that point), and while the original (Vaswani et al., 2017) transformer paper was published in 2017, it was not widely known until a few years later. (Not to mention – the Eslami et al. work was published in 2018, in Science, meaning that it was well into the review cycle before the Vaswani work became known.)

One of the most interesting things about the Eslami et al. work is that they DO CONSTRUCT an explicit “physical object model” representation that is a distinct representation layer from the original training images. So … a step towards that “multiple representation levels” approach that we believe will be needed for true AGI.

Their work is coincident with that of Piloto et al. (2018, 2022), done at Google DeepMind, and so does not reference Piloto et al. – but there are distinct parallels.

So – excellent article, laying the foundation for pre-transformer thinking about visual scene representation.

Kang et al. (2024)

{* To be continued … AJM, Tuesday, Jan. 28, 7:40 AM Hawai’i time. *}

Motamed et al. (2025)

{* To be continued … AJM, Tuesday, Jan. 28, 7:40 AM Hawai’i time. *}

Another Set of Resources (from Northwestern Student A.J.)

Northwestern student A.J. also did a “deep dive,” looking for resources to help understand how Veo-2 works. A.J. found several, and I’m sharing A.J.’s comments and contributions verbatim, with my own commentary thrown in.

1. Blogpost by Hassabis et al.

[A.J.’s contribution:] This blog post by Demis Hassabis, James Manyika, and Jeff Dean, reviews achievements at Google in AI in 2024 and mentions the release of Veo-2 and Imagegen 3 state-of-the-art models. The authors mention Veo-2 making use of AI technology for controlling the physical properties of objects – Hassabis et al. Google Blogpost.

[AJM’s commentary:] This is a very good summary by Demis Hassabis (CEO Google DeepMind), James Manyika (DeepMind Sr. V.P., Research, Tech., & Society), and Jeff Dean (DeepMind Chief Scientist), overviewing DeepMind’s AI advances during 2024. Good read, I’m recommending it to all my students!

2. Smoothly Editing Material Properties

[A.J.’s contribution:] Google Research engineers Matthews and Li describe “parametric editing of material properties” in this Google Research Blogpost on Smoothly Editing .

The authors, Mark Matthews, and Yuanzhen Li, describe developing a method that adds the capability to a generative model to edit the properties of physical objects smoothly. They describe their technique in the paper – Alchemist: Parametric Control of Material Properties with Diffusion Models (https://arxiv.org/pdf/2312.02970), which allows for changing the physical object’s properties without altering its shape or geometry.

The article mentions they use CGI to render images that are generated by modifying one property of the physical object at at time such as roughness, shine, transparency and others. Further, they mention modifying Stable Diffusion Model (High-Resolution Image Synthesis with Latent Diffusion Models – https://arxiv.org/abs/2112.10752) to enable the fine-grained control of material parameters.

Finally, the article mentions using NeRF reconstruction (https://www.matthewtancik.com/nerf) to synthesize the views using the images generated above. The paper that describes NeRF is called NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (https://arxiv.org/abs/2003.08934)

[AJM’s commentary:] {* TBD *}

{* to be completed – AJM, Feb. 3, 2025. *}

Resources and References

AGI (Neuro-Symbolic) Architectures

Wan, Zishen, et al. 2024. “Towards Efficient Neuro-Symbolic AI: From Workload Characterization to Hardware Architecture.” IEEE Trans. on Circuits and Systems for Artificial Intelligence (September, 2024). Available also on arXiv; arXiv:2409.13153v2 [cs.AR] (23 Sep 2024). (Accessed Oct. 7, 2024; available at arXiv.)

Video/Image/Scene Understanding and Generation

Eslami, et al. 2018. “Neural Scene Representation and Rendering.” Science 360:6394. (Accessed Jan. 20, 2025; available at: https://www.science.org/doi/10.1126/science.aar6170. (This is for those who have established an account with Science; this can be done free of charge; the abstract alone is available at https://www.science.org/doi/10.1126/science.aar6170.)
Kang, B. et al. 2024. “How Far Is Video Generation from World Model: A Physical Law Perspective.” arXiv preprint arXiv:2411.02385 (2024). (Accessed Jan. 20, 2025; available online at arXiv.)
Motamed, Saman, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. 2025. “Do Generative Video Models Learn Physical Principles from Watching Videos?” arXiv:2501.09038v1 [cs.CV] (14 Jan 2025). (Accessed Jan. 20, 2025; available online at arXiv.)
Piloto, Luis, Ari Weinstein, Dhruva T.B., Arun Ahuja, Mehdi Mirza, Greg Wayne, David Amos, Chia-chun Hung, Matthew Botvinick. 2018. “Probing Physics Knowledge Using Tools from Developmental Psychology.” arXiv:1804.01128v1 [cs.AI] (3 Apr 2018). (Accessed Jan. 7, 2025; available at arXiv.)
Piloto, Luis, Ari Weinstein, Peter Battaglia, and Matt Botvinic. 2022. “Intuitive Physics Learning in a Deep-Learning Model Inspired by Developmental Psychology.” Google DeepMind Blogpost Series (11 Jul 2022). (Accessed Jan. 7, 2025; available online at Google DeepMind Blogpost Series.)

Articles Suggested by A.J.

Hassabis, Demis, James Manyika , and Jeff Dean. 2025. “2024: A Year of Extraordinary Progress and Advancement in AI.” Google Blogpost (Jan. 23, 2025). (Accessed Feb. 4, 2025; available online at Hassabis et al. Blogpost.)

Where AGI Will REALLY Come From

Hogarth, Ian. 2024. “Can Europe Build Its First Trillion-Dollar Start-Up?” Financial Times (Nov. 29, 2024). (Accessed Feb. 3, 2025; available online at Financial Times.)
Internal Tech Emails.com. Jan. 12, 2025. “OpenAI Is on a Path of Certain Failure.” (Accessed Feb. 3, 2025; available online at techemails.com.)
Pesce, Mark. 2024. “Copilot’s Crudeness has left Microsoft Chasing Google, Again.” The Register (Oct. 9, 2024). (Accessed Feb. 3, 2025; available online at The Register.)
Kalimeris, Dion. 2024. “Salesforce Agentforce vs Microsoft Copilot: Collaboration or Competition?” LinkedIn (Sept. 23, 2024). (Accessed Feb. 3, 2025; available online at LinkedIn.)
xai Blogpost. 2024. “xAI raises $6B Series C.” xai Blogpost series (Dec. 23, 2024). (Accessed Feb. 3, 2025; available online at xai Blogpost.)