The Road to AGI: The GPU No Longer Exists — Thoughts After the NVIDIA GTC Keynote
This title has been spinning in my head for a long time, but it was so broad that I couldn't find a suitable entry point. NVIDIA's recent keynote seems to have provided an opportunity—a chance to dismantle the concept piece by piece.
1. The Prerequisite for AGI
Actually, this is a question that people still haven't discussed clearly enough. However, under the context that the Transformer architecture remains effective, it seems that more data and larger-scale models are the path with higher certainty. Regarding data, there is almost no dispute: more data must come from the real physical world. This "more" could mean several orders of magnitude, so securing data is a long journey. More efficient data computation is another endless road; this is the goal of NVIDIA's Blackwell architecture and the foundation for leading its competitors once again.
2. The hardware problem NVIDIA faces is: How to fit as much computational demand as possible into a hardware system constrained by physical laws. Therefore, from today onwards, the GPU as a standalone unit no longer exists; only the GPU System is the correct concept.
The capability of a single chip (the term "die" is more appropriate here) is limited by Moore's Law. Consequently, the only way forward is to demand efficiency from "density" through continuous process improvements: the more advanced the process, the more transistors per unit area, and the higher the computing power. This is the path NVIDIA followed up until the Hopper architecture. However, competitors like AMD, with the MI300, have been using chiplets to expand the actual area of the chip to enhance single-chip capability.
Improving the capability of a single die is extremely limited. If we calculate carefully, on a per-die basis, Blackwell only offers a 25% improvement over Hopper. However, NVIDIA has also started increasing chip-level capability. A Blackwell GPU is made of two dies "joined" together, making the Blackwell GPU 2.5 times as powerful as a Hopper GPU (250% / 2 - 100% = 25%). On one hand, this shows how difficult single-die improvement has become; on the other, it represents an exploration of a new direction. (I suspect that while H100 remains in short supply and CoWoS production is expanding, the Blackwell architecture can technically utilize existing capacity more effectively while differentiating product forms to avoid cannibalizing H100 demand.) This also partially confirms the authenticity and feasibility of NVIDIA's leaked roadmap (shifting from a two-year to a one-year product update cycle). If a new architecture is released next year, it could be 3nm. I think they are learning from Apple's M-series chips: with every process upgrade, they first release the base model—like M1, M2, M3 MAX for MacBooks, iMacs, and Mac Minis—and then later "stitch" two MAX chips together to create an Ultra for the Mac Studio or Mac Pro.
Thus, through "joining," one can double or more than double single-chip capability under the same process node. But constrained by physics, this "joining" must have limits; otherwise, Apple would have put the rumored "quad-joined" M2 Extreme in the Mac Pro last year instead of the "double-joined" Ultra. Of course, rumors suggest the Apple Car team managed a "quad-joined" design before the project was disbanded. This shows that although the road is difficult, compared to the tiny constraints on a single die, there is more room to maneuver at the system level.
Expanding further into physical space involves interconnecting chips. Using NVLink and InfiniBand (IB) networks to "join" clusters comes at the cost of significantly decreased transmission speeds. Before ChatGPT, the demand for the most advanced network transmission was relatively insufficient; we saw 800G optical modules ready while the penetration rate of 200G switches remained low. Over the past year, cluster development has accelerated mostly through "remedial" networking (though improvements in single-chip capability and HBM memory speeds have contributed significantly). This is essentially demanding computing power from "data center floor space." We call this a GPU Cluster.
With the Blackwell launch, NVIDIA introduced the GB200 NVL72 (one rack with 18 servers, each with two GB200s, totaling 72 Blackwell GPUs) and formally referred to this as a "GPU." Of course, "GPU System" is more accurate. For AI, the standalone GPU is dead; the GPU System is the new reality.
This is the evolution of hardware: increasing power on the die --> increasing power on the chip --> increasing power in the system. The further up you go, the looser the physical constraints, but the slower the speed (AI models, whether training or inference, are essentially high-speed data transfers).
However, technological evolution always seeks to break physical limits: process nodes increase compute density within die area limits; chiplets increase density within packaging area limits; and systems increase density within floor space limits. (Whether using SerDes copper or fiber optics within a system, the longer the cable, the more significant the signal attenuation.)
The last point is the most intuitive: how much compute (chips or dies) can fit into the same data center area? The more you pack in, the more complex the wiring. A GB200 NVL72 uses a staggering 2 miles of copper cabling.
This is currently the system with the highest density of GPUs in a single rack: 72 Blackwell GPUs (each with two Blackwell dies), compared to the previous standard of about 32 (a DGX H100 rack with four machines, each with 8 H100s). In terms of compute, the single-rack power in the GB200 era is 72 * 2.5 : 32 = 180 : 32 = 5.625 times that of the H100 era.
(Breaking it down: 25% increase per die, 100% increase per chip, and 125% increase per rack [72/32 - 1 = 125%].)
So, if we believe a prerequisite for AGI is a 2-5 order of magnitude increase in cluster computing power, we are already seeing a nearly 6-fold increase within the largest physical unit (the rack). We can look forward to this: Meta's 24K H100 cluster is the largest publicly disclosed; next, within the same data center footprint, a 100,000-card cluster can be challenged.
3. Extracting Data from the Physical World Based on Large Models
If the road to AGI ultimately requires more data from the physical world, and if "everything is computation" ultimately requires more physical world data...
Then how do we get that data?
We need a carrier—something that can complete more tasks while interacting with the physical world and simultaneously collect as much data as possible. This carrier is Embodied AI. Smart cars are carriers; robots (humanoid or otherwise) are carriers.
Setting aside unrealistic sci-fi doomsday scenarios (which I believe have a probability approaching zero), Large Language Models have provided an excellent foundation: human-machine interaction through a symbolic language system that machines now basically "understand." We give commands; we receive feedback.
As machines execute tasks, they collect data that is sent back to the cloud for training. Through repeated iterations, we look for the path where a "World Model" might emerge—the path to AGI.
4. From the Internet of Everything to the Computing of Everything
The "Mobile Internet" era was the era of the "Internet of Everything." The massive amounts of data generated during this period are the most important foundation for the existence of large models. However, when it comes to the door of the "Digital Twin"—interacting in the physical world and computing in the digital world—the accumulation of the Mobile Internet era can only push the door open a crack.
In the march toward the "Computing of Everything," we need the calculator (GPU Systems), we need the computing programs (the rapidly forming Large Model Operating Systems), and we need the aforementioned Embodied AI.
After cooling down, one suddenly realizes that this is the true "worldview" revealed in Jensen Huang's two-hour speech. It is NVIDIA's ambition to be the sole infrastructure of the AI era where everything is computation.