Transformer Architecture, Superpowers, And The March Toward AGI
When you scratch the surface of some of the best new development and research on AI, you see that we’re likely very close to artificial general intelligence, that next generation or step that will give AI super-human powers of thought.
I watched this entire video where Sam Altman talks about novel architectures, and what they mean for the industry, about replacing the transformer as an attention mechanism, and the result of that approach.
It’s wild stuff, and it’s a big deal.
Using methods like compressed context to build new architectures means the AI entities will better understand the world around them, which in turn will translate into much more capability in all kinds of tasks. Altman used the example of Apple’s LITO model which can compile a three-dimensional object from one two-dimensional image.
But there’s much more to the story.
The Transformer and its Legacy
What about the transformer? Is it so central that it’s hard to replace?
Basically, as Altman and others explain, the transformer replaced more primitive architectures with something that focuses AI’s attention, for maximum efficiency and more targeted convergence on a given problem. But now, approaches like subquadratic scaling and liquid networks are making the transformer effectively obsolete, at least in some ways.
In a segment at our Imagination in Action event April 9–10 (April’s IIA event is an annual conference that I help to facilitate here at MIT), my colleague from Link Ventures Dave Blundin interviewed Peter Danenberg of Google, and Alexander Amini of Liquid AI (further disclaimer, I have also been involved in Liquid AI).
Amini mentioned how big players like Alibaba and Qwen are moving away from, or beyond, the transformer.
“We're seeing this shift already today, that you know, the top models, even trillion parameter models now, are hybrid models between the transformers and other models as well,” he said.
However, Amini’s evaluation was nuanced, and he suggested that we may still use transformers for some specific kinds of projects.
“Every architecture is good for its own situation, right?” Amini said. “Transformers are good for some things, but for some hardware, for some use cases, for some design decisions, there are much better architectures, mathematical operators that exist.”
“It may be the case that transformers are slightly saturated,” added Danenberg. “I don't know if we're going to get the next step function from transformers.”
The panel also had a discussion about how Google’s tensor processing hardware is competing with the Nvidia designs that drove the latter company to the top of the heap on the American stock market.
“One of the things that that's been evolving very quickly is somebody will have a brilliant algorithmic breakthrough, for example, and then within the Google empire, immediately, there's a conversation on how to convert it to silicon,” Blundin said.
That, Amini suggested, makes sense in the context of a market where hardware and digital operations should be, in his words, “married” together.
“This is a philosophy that we have at liquid, that architecture and hardware should be married together, right?” Amini said. “And that if you want to build the best quality, AI, quality is also dependent on speed efficiency, energy efficiency as well. So you need these things to be co-optimized together.”
Different Kinds of Customers
Amini suggested there’s a big difference between general-purpose use cases, and those specific to an industry, where a client wants something very targeted.
“It comes down to value delivery at the end of the day, and delivering value for different types of use cases,” he said. “If you want to deploy, let's say, an AI to be a conversational assistant to you in the car, you probably don't need that AI to be optimized on PhD-level physics, right? There's a different type of distribution that will be important for those domains, and those enterprises, that is fundamentally different in scale than the big AI providers, and that will yield itself in the forms of new, efficient architectures that effectively get distilled down from the big ones.”
Danenberg painted a picture of competitive business ecosystems.
“Startups and a bunch of businesses run constellations of small models, almost like the Unix philosophy, each one of which sort of does one thing, and does it well,” he said. “And really, the job of the business, in that sense, is just to sort of orchestrate these tiny models.”
Understanding the “Beast”
Later in the discussion, Danenberg explained how, at the outset of debate about AGI, he struggled to figure out what people would practically use the technology for. Then, he said, he started to imagine how this sort of thing would emerge.
“When this really slow, sort of expensive, big model with a lot of world knowledge baked in, when that thing is following you around all day, it turns out that it comes up with these really interesting connections that you may not have anticipated,” he said. “It may be that there are certain cases where you actually do want these, these ‘beasts.’”
Blundin added his two cents.
“These things, they cross the uncanny valley very, very quickly,” he said, “and they become something you just like, can't live without. And you know, when they're just below that level, they're icky as all hell, but when they cross the line, ‘whoa,’ instantaneously.”
There was more in the discussion, about Moore’s law, specialists vs. generalists, and the appeal of general purpose Nvidia hardware, to name a few.
“I feel like the world in AI will move towards hierarchical memory more and more,” Amini offered. “And right now, AI models don't have this right - we have moved entirely away from hierarchical memory, because all of the memory is stored in the context, and you basically invest into storing all of that as tokens.”
Blundin, toward the end, asked about what jobs could be “10x-ed” or made ten times more effective with AI.
“I'm actually at a loss sometimes to think of things that aren't 10x-able in that sense,” Danenberg said. “So I don't know, maybe we're shooting for 95% to 99%.”
“I think it's the majority today,” Amini added.
If we really can 10x 99% of human jobs, or replace that much productivity, what happens to human workers? That’s a big question that’s making the rounds right now, in 2026, as we see new research on more powerful models continue. Stay tuned.
Loading article...