Oct 12, 2025

海量算力投资背后的不确定性：模型架构与硬件架构的高度绑定，基于Gemini-2.5分析的结果

Computing power demand is a complex issue. For AI training and the goals of AGI or ASI, one can never have too much. However, within my framework of thought, I have recently been focusing on a more granular usage cost model.

Specifically, I aim to answer: Under different model architectures, as hardware improves, where exactly does the performance increase, and where are the critical bottlenecks? This relates to commercial implementation and the direction of future evolution.

After simplifying as much as possible, I returned to MLPerf inference performance data from V3.0 to V5.1. The hardware includes A100, H100, H200, B200, GB200 NVL-72, as well as L40S and RTX 6000. The models covered are GPT-J, Llama-2-70b, Llama-405b, DeepSeek-R1, and Stable Diffusion-XL.

I reviewed this data manually, but for a more complete analysis and charting, I utilized models: GPT-5-Pro + GPT-Agent, and Gemini-2.5-Pro within AI-Studio.

GPT-5-Pro actually performed quite well—analyzing, drawing, and even creating a visualization website for me. However, after seeing the output from Gemini-2.5-Pro, I chose Gemini without hesitation. Reasons: 1. Gemini's understanding of GPU counts was clearly more accurate, correctly identifying that GB200 consists of 72 GPU nodes (18x4); 2. Its causal analysis was thorough and comprehensive.

I'll start with my own conclusions:

Currently, the coupling between hardware architecture and model architecture is too high. For example, comparing GB200 NVL-72 and DGX B200, the performance gap for the MoE-architecture R1 is nearly tenfold, but for the dense Llama-405B, the gap narrows to about 4-5 times. This means that if models were to return to larger dense structures in the future, the NVL-72 would be very uneconomical. Of course, you could argue this cluster is best for training, but then I must continue to "accuse" NVIDIA of "false advertising," as their latest claim is that investing in NVL-72 can yield a 15x return through inference.
If we believe the next breakout point is multimodality—though SD-XL is only a first-gen text-to-image model—its results represent performance comparisons for image and video generation: B200 offers less than a 2x performance boost over H200. Don't forget, B200 has two dies; theoretically, ignoring HBM bandwidth, performance should be about 2.5x that of a Hopper Die. The gains Blackwell chips bring to multimodality are limited.
HBM bandwidth's impact on inference performance is direct: In model inference that fits on a single card, H200 shows over 30% improvement over H100 (4.8T vs 3.35T memory bandwidth).
At this point in time, the probability of model architectures changing significantly over the next 2-3 years is very high, meaning hardware architectures will change accordingly. This uncertainty is a massive challenge for model companies and CSP operators.
Forgive me for not being more blunt.

Before presenting the Gemini report, here are a few screenshots from GPT-5-Pro.

GPT-5-Pro Analysis 2

Next is the complete report provided by Gemini (there is some garbled text in the Matplotlib charts it generated due to Chinese font support settings—anyone familiar with Python knows this requires changing system fonts, but in Gemini, since the code runs in a sandbox, it cannot be changed). One additional note: in its response, the report text and images were separate with placeholders, so I pasted them manually. Someone will surely ask how to set this up; in AI-Studio, enable "Code Execution" and "Grounding" as shown below. Recently, people have asked about Gemini "downgraded intelligence" (model degradation); my answer is that I haven't noticed it—it likely depends on the prompts and usage methods.

Also, there are some small bugs in the text; let's see if any "sharp-eyed" friends can spot them.

AI-Studio Settings

AI Accelerator Performance Benchmark Comprehensive Analysis Report (Advanced Detailed Version)

1. Introduction

This report is based on public benchmark results from MLCommons® Inference v3.1, v4.0, v4.1, v5.0, and v5.1. It provides a comprehensive and in-depth performance evaluation of current mainstream AI accelerators (including NVIDIA GPUs and Google TPUs). The report aims to provide a detailed, quantitative performance reference for AI infrastructure decision-makers, researchers, and engineers.

Analysis dimensions include:

Accelerator Dimension: Evaluates multiple NVIDIA GPUs from the Ampere architecture to the latest Blackwell architecture, as well as Google's TPU v5e.
Model Dimension: Covers five AI models ranging from widely used industry standards to cutting-edge trends, including GPTJ-99, Llama2-70b-99.9, Stable-Diffusion-XL, DeepSeek-R1, and Llama3.1-405b.
Scenario Dimension: Distinguishes between "Offline" scenarios measuring maximum throughput and "Server" scenarios simulating real-world service.

The core of the report provides comparisons of raw performance data and standardized "performance per accelerator," combined with core technical specifications of the hardware.

2. Data and Methodology

2.1. Data Sourcing and Integration

This report integrates MLCommons Inference test result files from five different versions. All data were merged into a unified dataset and underwent rigorous cleaning processes, including unifying column names, removing redundant data, and handling missing values.

2.2. Core Calculation Metrics

To ensure fairness and transparency in comparison, we defined the following core metrics:

Total Accelerators: The total number of chips used in the testing platform (Number of nodes × Accelerators per node).
Result per Accelerator: This is the most important standardized metric in this report, calculated as: Benchmark Average Result / Total Accelerators. It removes the influence of cluster size to measure the true inference capability of a single chip.

3. Core Technical Specifications of Accelerators

Understanding the theoretical limits of hardware is the foundation for analyzing performance. The table below summarizes the core technical specifications of the accelerators involved in the report:

Accelerator	Architecture	FP16/BF16 Compute (TFLOPS)	Memory Type	Memory Bandwidth (GB/s)	TDP (W, Approx.)
NVIDIA A100-PCIe-80GB	Ampere	624	HBM2e	1935	300
NVIDIA A100-SXM-80GB	Ampere	624	HBM2e	2039	400
NVIDIA L40S	Ada Lovelace	733	GDDR6	864	350
NVIDIA H100-PCIe-80GB	Hopper	1513	HBM3	2000	350
NVIDIA H100-SXM-80GB	Hopper	1979	HBM3	3350	700
NVIDIA H100 NVL	Hopper	3342	HBM3	7800	800
NVIDIA H200-SXM-141GB	Hopper	1979	HBM3e	4800	700
Virtualized H200-SXM-141GB	Hopper	1979	HBM3e	4800	700
NVIDIA H200-NVL-141GB	Hopper	3958	HBM3e	9600	1200
NVIDIA GH200 Superchip	Hopper	1979	LPDDR5X	4800	1000
NVIDIA RTX 6000 Blackwell	Blackwell	250	GDDR7	1792	600
NVIDIA B200-SXM-180GB	Blackwell	2500	HBM3e	8000	1000
NVIDIA GB200 Superchip	Blackwell	2500	HBM3e	8000	1000
Google TPU v5e	TPU v5	197	-	819	Unknown

4. Detailed Performance Analysis by Model

4.1. GPTJ-99 Model

GPTJ-99 is a medium-scale (6 billion parameter) language model used to evaluate the baseline performance of accelerators for standard NLP tasks.

Accelerator	Scenario	Avg Total Result	Avg Result per Accelerator	Avg Total Accelerators
NVIDIA A100-PCIe-80GB	Offline	14.70	3.68	4.0
NVIDIA A100-PCIe-80GB	Server	13.81	3.45	4.0
NVIDIA A100-SXM-80GB	Offline	27.13	3.39	8.0
NVIDIA A100-SXM-80GB	Server	16.92	2.12	8.0
NVIDIA GH200 Superchip	Offline	26.00	26.00	1.0
NVIDIA GH200 Superchip	Server	24.62	24.62	1.0
NVIDIA H100 NVL	Offline	43.56	21.78	2.0
NVIDIA H100 NVL	Server	42.07	21.04	2.0
NVIDIA H100-PCIe-80GB	Offline	4352.75	1089.08	4.7
NVIDIA H100-PCIe-80GB	Server	3395.97	932.53	4.7
NVIDIA H100-SXM-80GB	Offline	7470.24	1001.95	6.8
NVIDIA H100-SXM-80GB	Server	7353.27	985.96	6.8
NVIDIA H200-NVL-141GB	Offline	17905.15	2238.14	8.0
NVIDIA H200-NVL-141GB	Server	18141.45	2267.68	8.0
NVIDIA H200-SXM-141GB	Offline	18090.81	2412.90	7.3
NVIDIA H200-SXM-141GB	Server	18045.91	2399.25	7.3
NVIDIA L40S	Offline	2275.52	525.25	4.9
NVIDIA L40S	Server	2190.09	508.65	4.9
Google TPU v5e	Offline	9.98	2.50	4.0
Google TPU v5e	Server	7.19	1.80	4.0

Note: Units vary between test versions (Samples/s vs Tokens/s); values are compared directly here.

Offline Scenario: GPTJ Offline Chart

Server Scenario: GPTJ Server Chart

Analysis Insights:

H200 Series Leadership: In both scenarios, the H200 series leads single-card performance, thanks to its 4.8 GB/s HBM3e bandwidth.
SXM Advantage: H100-SXM performs significantly better than H100-PCIe, reflecting the benefits of higher power and NVLink.
TPU v5e Performance: Performance is close to the A100 series but trails significantly behind Hopper and newer architectures.

4.2. Llama2-70b-99.9 Model

Llama2-70b is a core benchmark for evaluating the ability of modern accelerators to handle mainstream large language models.

Accelerator	Scenario	Avg Total (Tokens/s)	Result per Accelerator (Tokens/s)	Avg Accelerators
NVIDIA B200-SXM-180GB	Offline	88741.00	12303.93	7.2
NVIDIA B200-SXM-180GB	Server	88345.62	12218.53	7.2
NVIDIA GB200	Offline	50710.20	12677.55	4.0
NVIDIA GB200	Server	49287.75	12321.94	4.0
NVIDIA GH200 Superchip	Offline	3871.47	3871.47	1.0
NVIDIA GH200 Superchip	Server	3616.88	3616.88	1.0
NVIDIA H100 NVL	Offline	9493.01	1917.94	5.0
NVIDIA H100 NVL	Server	8866.93	1756.92	5.0
NVIDIA H100-PCIe-80GB	Offline	6759.53	1399.84	5.0
NVIDIA H100-PCIe-80GB	Server	5697.54	1171.44	5.0
NVIDIA H100-SXM-80GB	Offline	27825.32	3748.26	9.0
NVIDIA H100-SXM-80GB	Server	26364.50	3544.65	9.0
NVIDIA H200-NVL-141GB	Offline	27689.65	3777.19	7.3
NVIDIA H200-NVL-141GB	Server	25210.00	3437.31	7.3
NVIDIA H200-SXM-141GB	Offline	34497.33	4379.04	8.1
NVIDIA H200-SXM-141GB	Server	32453.00	4113.87	8.1
NVIDIA L40S	Offline	3143.23	446.58	7.0
NVIDIA L40S	Server	2767.63	391.87	7.0
NVIDIA RTX PRO 6000	Offline	26205.30	3275.66	8.0
NVIDIA RTX PRO 6000	Server	26001.04	3250.13	8.0
Virtualized NVIDIA H200-SXM	Offline	34485.70	4310.71	8.0
Virtualized NVIDIA H200-SXM	Server	33370.58	4171.32	8.0

Offline Scenario: Llama2 Offline Chart

Server Scenario: Llama2 Server Chart

Analysis Insights:

Blackwell Dominance: B200 and GB200 offer 3x the performance of H200 series, highlighting deep optimization for Transformers and 8000 GB/s bandwidth.
Virtualization Efficiency: Virtualized H200 performance is nearly identical to physical hardware.
L40S Positioning: Significantly lower performance on LLM inference compared to dedicated data center cards.

4.3. Stable-Diffusion-XL Model

SD-XL is an advanced text-to-image model demanding high computation and memory access.

Accelerator	Scenario	Avg Total (Samples/s)	Result per Accelerator	Avg Accelerators
NVIDIA B200-SXM-180GB	Offline	29.24	3.95	7.4
NVIDIA B200-SXM-180GB	Server	26.44	3.56	7.4
NVIDIA GH200 Superchip	Offline	1.78	1.78	1.0
NVIDIA GH200 Superchip	Server	1.68	1.68	1.0
NVIDIA H100-PCIe-80GB	Offline	5.94	1.20	5.0
NVIDIA H100-PCIe-80GB	Server	5.04	1.02	5.0
NVIDIA H100-SXM-80GB	Offline	13.12	1.89	6.9
NVIDIA H100-SXM-80GB	Server	12.67	1.82	6.9
NVIDIA H200-NVL-141GB	Offline	14.93	2.07	7.2
NVIDIA H200-NVL-141GB	Server	14.05	1.95	7.2
NVIDIA H200-SXM-141GB	Offline	17.86	2.30	7.8
NVIDIA H200-SXM-141GB	Server	16.54	2.12	7.8
NVIDIA L40S	Offline	3.80	0.67	5.7
NVIDIA L40S	Server	3.61	0.62	5.7
NVIDIA RTX PRO 6000	Offline	11.08	1.38	8.0
NVIDIA RTX PRO 6000	Server	10.88	1.36	8.0
Google TPU v5e	Offline	1.75	0.44	4.0
Google TPU v5e	Server	1.55	0.39	4.0
Virtualized NVIDIA H200-SXM	Offline	18.64	2.33	8.0
Virtualized NVIDIA H200-SXM	Server	17.95	2.24	8.0

Offline Scenario: SDXL Offline Chart

Server Scenario: SDXL Server Chart

Analysis Insights:

B200 Doubling Performance: Per-accelerator performance is roughly 2x that of H100-SXM.
Low Virtualization Overhead: Virtualized H200 performed excellently, even slightly higher than physical H200 in offline tests.
GH200 Performance: Comparable to H100-SXM, demonstrating the strength of the Hopper core.

4.4. DeepSeek-R1 Model

DeepSeek-R1 is an emerging large model; results here come from large-scale cluster configurations showing expansion capabilities.

Accelerator	Scenario	Avg Total Result (Samples/s)	Result per Accelerator	Avg Accelerators
NVIDIA B200-SXM-180GB	Offline	31486.55	3935.82	8.0
NVIDIA B200-SXM-180GB	Server	15415.43	1926.93	8.0
NVIDIA GB200	Offline	289712.00	72428.00	72.0
NVIDIA GB200	Server	167578.00	41894.50	72.0

Offline Scenario: Deepseek Offline Chart

Server Scenario: Deepseek Server Chart

Analysis Insights:

Scale Effects: GB200's performance per accelerator is far higher than B200, revealing the massive non-linear performance gains from the NVL72 cluster and high-speed interconnects.
Server Scenario Challenges: Performance drops significantly in real-time server scenarios, reflecting the difficulty of maintaining utilization under low-latency constraints.

4.5. Llama3.1-405b Model

Llama3.1-405b is an ultimate test of AI hardware performance and memory capacity with its 405B parameters.

Accelerator	Scenario	Avg Total Result (Samples/s)	Result per Accelerator	Avg Accelerators
NVIDIA B200-SXM-180GB	Offline	1613.16	201.65	8.0
NVIDIA B200-SXM-180GB	Server	1179.05	147.38	8.0
NVIDIA GB200	Offline	6233.13	1558.28	31.2
NVIDIA GB200	Server	4433.33	1108.33	31.2
NVIDIA H100-SXM-80GB	Offline	794.56	99.32	16.0
NVIDIA H100-SXM-80GB	Server	557.35	69.67	16.0
NVIDIA H200-SXM-141GB	Offline	610.17	76.27	8.8
NVIDIA H200-SXM-141GB	Server	319.86	39.98	8.8
Virtualized NVIDIA H200-SXM	Offline	547.26	68.41	8.0
Virtualized NVIDIA H200-SXM	Server	277.33	34.67	8.0

Offline Scenario: Llama3.1 Offline Chart

Server Scenario: Llama3.1 Server Chart

Analysis Insights:

GB200 Built for Giants: GB200 shows its power in handling ultra-large models through massive cluster scaling.
Memory Bottlenecks: Interestingly, H200 showed lower per-accelerator performance here than H100, likely due to H100 tests using more GPUs for tensor parallelism while H200 used fewer, creating communication bottlenecks.
Blackwell Success: B200 yields a 2x generational leap over H100 for this ultra-large model.

5. Conclusion

Generational evolution is the key driver: Blackwell (B200, GB200) sets a new benchmark for AI performance.
The "Bucket Effect": Compute, Memory, and Interconnects:
- For large models, memory bandwidth is often the primary bottleneck.
- When models exceed single-node capacity, high-speed interconnects (NVLink) and topology become critical, as seen in GB200's results.
Product Positioning: Data center flagships (B200, H200 SXM) are the choices for cutting-edge AI, while PCIe versions suit mid-scale tasks.
Virtualization is Mature: GPU virtualization offers minimal overhead for AI inference deployment.

Choosing AI hardware is a multi-dimensional decision. This report aims to support those decisions with quantitative data from MLCommons benchmarks.