海量算力投资背后的不确定性:模型架构与硬件架构的高度绑定,基于Gemini-2.5分析的结果

海量算力投资背后的不确定性:模型架构与硬件架构的高度绑定,基于Gemini-2.5分析的结果


Computing power demand is a complex issue. For AI training and the goals of AGI or ASI, one can never have too much. However, within my framework of thought, I have recently been focusing on a more granular usage cost model.

Specifically, I aim to answer: Under different model architectures, as hardware improves, where exactly does the performance increase, and where are the critical bottlenecks? This relates to commercial implementation and the direction of future evolution.

After simplifying as much as possible, I returned to MLPerf inference performance data from V3.0 to V5.1. The hardware includes A100, H100, H200, B200, GB200 NVL-72, as well as L40S and RTX 6000. The models covered are GPT-J, Llama-2-70b, Llama-405b, DeepSeek-R1, and Stable Diffusion-XL.

I reviewed this data manually, but for a more complete analysis and charting, I utilized models: GPT-5-Pro + GPT-Agent, and Gemini-2.5-Pro within AI-Studio.

GPT-5-Pro actually performed quite well—analyzing, drawing, and even creating a visualization website for me. However, after seeing the output from Gemini-2.5-Pro, I chose Gemini without hesitation. Reasons: 1. Gemini's understanding of GPU counts was clearly more accurate, correctly identifying that GB200 consists of 72 GPU nodes (18x4); 2. Its causal analysis was thorough and comprehensive.

I'll start with my own conclusions:

  • Currently, the coupling between hardware architecture and model architecture is too high. For example, comparing GB200 NVL-72 and DGX B200, the performance gap for the MoE-architecture R1 is nearly tenfold, but for the dense Llama-405B, the gap narrows to about 4-5 times. This means that if models were to return to larger dense structures in the future, the NVL-72 would be very uneconomical. Of course, you could argue this cluster is best for training, but then I must continue to "accuse" NVIDIA of "false advertising," as their latest claim is that investing in NVL-72 can yield a 15x return through inference.
  • If we believe the next breakout point is multimodality—though SD-XL is only a first-gen text-to-image model—its results represent performance comparisons for image and video generation: B200 offers less than a 2x performance boost over H200. Don't forget, B200 has two dies; theoretically, ignoring HBM bandwidth, performance should be about 2.5x that of a Hopper Die. The gains Blackwell chips bring to multimodality are limited.
  • HBM bandwidth's impact on inference performance is direct: In model inference that fits on a single card, H200 shows over 30% improvement over H100 (4.8T vs 3.35T memory bandwidth).
  • At this point in time, the probability of model architectures changing significantly over the next 2-3 years is very high, meaning hardware architectures will change accordingly. This uncertainty is a massive challenge for model companies and CSP operators.
  • Forgive me for not being more blunt.

Before presenting the Gemini report, here are a few screenshots from GPT-5-Pro.

GPT-5-Pro Analysis 2

Next is the complete report provided by Gemini (there is some garbled text in the Matplotlib charts it generated due to Chinese font support settings—anyone familiar with Python knows this requires changing system fonts, but in Gemini, since the code runs in a sandbox, it cannot be changed). One additional note: in its response, the report text and images were separate with placeholders, so I pasted them manually. Someone will surely ask how to set this up; in AI-Studio, enable "Code Execution" and "Grounding" as shown below. Recently, people have asked about Gemini "downgraded intelligence" (model degradation); my answer is that I haven't noticed it—it likely depends on the prompts and usage methods.

Also, there are some small bugs in the text; let's see if any "sharp-eyed" friends can spot them.

AI-Studio Settings

AI Accelerator Performance Benchmark Comprehensive Analysis Report (Advanced Detailed Version)

1. Introduction

This report is based on public benchmark results from MLCommons® Inference v3.1, v4.0, v4.1, v5.0, and v5.1. It provides a comprehensive and in-depth performance evaluation of current mainstream AI accelerators (including NVIDIA GPUs and Google TPUs). The report aims to provide a detailed, quantitative performance reference for AI infrastructure decision-makers, researchers, and engineers.

Analysis dimensions include:

  • Accelerator Dimension: Evaluates multiple NVIDIA GPUs from the Ampere architecture to the latest Blackwell architecture, as well as Google's TPU v5e.
  • Model Dimension: Covers five AI models ranging from widely used industry standards to cutting-edge trends, including GPTJ-99, Llama2-70b-99.9, Stable-Diffusion-XL, DeepSeek-R1, and Llama3.1-405b.
  • Scenario Dimension: Distinguishes between "Offline" scenarios measuring maximum throughput and "Server" scenarios simulating real-world service.

The core of the report provides comparisons of raw performance data and standardized "performance per accelerator," combined with core technical specifications of the hardware.

2. Data and Methodology

2.1. Data Sourcing and Integration

This report integrates MLCommons Inference test result files from five different versions. All data were merged into a unified dataset and underwent rigorous cleaning processes, including unifying column names, removing redundant data, and handling missing values.

2.2. Core Calculation Metrics

To ensure fairness and transparency in comparison, we defined the following core metrics:

  • Total Accelerators: The total number of chips used in the testing platform (Number of nodes × Accelerators per node).
  • Result per Accelerator: This is the most important standardized metric in this report, calculated as: Benchmark Average Result / Total Accelerators. It removes the influence of cluster size to measure the true inference capability of a single chip.

3. Core Technical Specifications of Accelerators

Understanding the theoretical limits of hardware is the foundation for analyzing performance. The table below summarizes the core technical specifications of the accelerators involved in the report:

Accelerator Architecture FP16/BF16 Compute (TFLOPS) Memory Type Memory Bandwidth (GB/s) TDP (W, Approx.)
NVIDIA A100-PCIe-80GB Ampere 624 HBM2e 1935 300
NVIDIA A100-SXM-80GB Ampere 624 HBM2e 2039 400
NVIDIA L40S Ada Lovelace 733 GDDR6 864 350
NVIDIA H100-PCIe-80GB Hopper 1513 HBM3 2000 350
NVIDIA H100-SXM-80GB Hopper 1979 HBM3 3350 700
NVIDIA H100 NVL Hopper 3342 HBM3 7800 800
NVIDIA H200-SXM-141GB Hopper 1979 HBM3e 4800 700
Virtualized H200-SXM-141GB Hopper 1979 HBM3e 4800 700
NVIDIA H200-NVL-141GB Hopper 3958 HBM3e 9600 1200
NVIDIA GH200 Superchip Hopper 1979 LPDDR5X 4800 1000
NVIDIA RTX 6000 Blackwell Blackwell 250 GDDR7 1792 600
NVIDIA B200-SXM-180GB Blackwell 2500 HBM3e 8000 1000
NVIDIA GB200 Superchip Blackwell 2500 HBM3e 8000 1000
Google TPU v5e TPU v5 197 - 819 Unknown

4. Detailed Performance Analysis by Model

4.1. GPTJ-99 Model

GPTJ-99 is a medium-scale (6 billion parameter) language model used to evaluate the baseline performance of accelerators for standard NLP tasks.

Accelerator Scenario Avg Total Result Avg Result per Accelerator Avg Total Accelerators
NVIDIA A100-PCIe-80GB Offline 14.70 3.68 4.0
NVIDIA A100-PCIe-80GB Server 13.81 3.45 4.0
NVIDIA A100-SXM-80GB Offline 27.13 3.39 8.0
NVIDIA A100-SXM-80GB Server 16.92 2.12 8.0
NVIDIA GH200 Superchip Offline 26.00 26.00 1.0
NVIDIA GH200 Superchip Server 24.62 24.62 1.0
NVIDIA H100 NVL Offline 43.56 21.78 2.0
NVIDIA H100 NVL Server 42.07 21.04 2.0
NVIDIA H100-PCIe-80GB Offline 4352.75 1089.08 4.7
NVIDIA H100-PCIe-80GB Server 3395.97 932.53 4.7
NVIDIA H100-SXM-80GB Offline 7470.24 1001.95 6.8
NVIDIA H100-SXM-80GB Server 7353.27 985.96 6.8
NVIDIA H200-NVL-141GB Offline 17905.15 2238.14 8.0
NVIDIA H200-NVL-141GB Server 18141.45 2267.68 8.0
NVIDIA H200-SXM-141GB Offline 18090.81 2412.90 7.3
NVIDIA H200-SXM-141GB Server 18045.91 2399.25 7.3
NVIDIA L40S Offline 2275.52 525.25 4.9
NVIDIA L40S Server 2190.09 508.65 4.9
Google TPU v5e Offline 9.98 2.50 4.0
Google TPU v5e Server 7.19 1.80 4.0

Note: Units vary between test versions (Samples/s vs Tokens/s); values are compared directly here.

Offline Scenario: GPTJ Offline Chart

Server Scenario: GPTJ Server Chart

Analysis Insights:

  • H200 Series Leadership: In both scenarios, the H200 series leads single-card performance, thanks to its 4.8 GB/s HBM3e bandwidth.
  • SXM Advantage: H100-SXM performs significantly better than H100-PCIe, reflecting the benefits of higher power and NVLink.
  • TPU v5e Performance: Performance is close to the A100 series but trails significantly behind Hopper and newer architectures.

4.2. Llama2-70b-99.9 Model

Llama2-70b is a core benchmark for evaluating the ability of modern accelerators to handle mainstream large language models.

Accelerator Scenario Avg Total (Tokens/s) Result per Accelerator (Tokens/s) Avg Accelerators
NVIDIA B200-SXM-180GB Offline 88741.00 12303.93 7.2
NVIDIA B200-SXM-180GB Server 88345.62 12218.53 7.2
NVIDIA GB200 Offline 50710.20 12677.55 4.0
NVIDIA GB200 Server 49287.75 12321.94 4.0
NVIDIA GH200 Superchip Offline 3871.47 3871.47 1.0
NVIDIA GH200 Superchip Server 3616.88 3616.88 1.0
NVIDIA H100 NVL Offline 9493.01 1917.94 5.0
NVIDIA H100 NVL Server 8866.93 1756.92 5.0
NVIDIA H100-PCIe-80GB Offline 6759.53 1399.84 5.0
NVIDIA H100-PCIe-80GB Server 5697.54 1171.44 5.0
NVIDIA H100-SXM-80GB Offline 27825.32 3748.26 9.0
NVIDIA H100-SXM-80GB Server 26364.50 3544.65 9.0
NVIDIA H200-NVL-141GB Offline 27689.65 3777.19 7.3
NVIDIA H200-NVL-141GB Server 25210.00 3437.31 7.3
NVIDIA H200-SXM-141GB Offline 34497.33 4379.04 8.1
NVIDIA H200-SXM-141GB Server 32453.00 4113.87 8.1
NVIDIA L40S Offline 3143.23 446.58 7.0
NVIDIA L40S Server 2767.63 391.87 7.0
NVIDIA RTX PRO 6000 Offline 26205.30 3275.66 8.0
NVIDIA RTX PRO 6000 Server 26001.04 3250.13 8.0
Virtualized NVIDIA H200-SXM Offline 34485.70 4310.71 8.0
Virtualized NVIDIA H200-SXM Server 33370.58 4171.32 8.0

Offline Scenario: Llama2 Offline Chart

Server Scenario: Llama2 Server Chart

Analysis Insights:

  • Blackwell Dominance: B200 and GB200 offer 3x the performance of H200 series, highlighting deep optimization for Transformers and 8000 GB/s bandwidth.
  • Virtualization Efficiency: Virtualized H200 performance is nearly identical to physical hardware.
  • L40S Positioning: Significantly lower performance on LLM inference compared to dedicated data center cards.

4.3. Stable-Diffusion-XL Model

SD-XL is an advanced text-to-image model demanding high computation and memory access.

Accelerator Scenario Avg Total (Samples/s) Result per Accelerator Avg Accelerators
NVIDIA B200-SXM-180GB Offline 29.24 3.95 7.4
NVIDIA B200-SXM-180GB Server 26.44 3.56 7.4
NVIDIA GH200 Superchip Offline 1.78 1.78 1.0
NVIDIA GH200 Superchip Server 1.68 1.68 1.0
NVIDIA H100-PCIe-80GB Offline 5.94 1.20 5.0
NVIDIA H100-PCIe-80GB Server 5.04 1.02 5.0
NVIDIA H100-SXM-80GB Offline 13.12 1.89 6.9
NVIDIA H100-SXM-80GB Server 12.67 1.82 6.9
NVIDIA H200-NVL-141GB Offline 14.93 2.07 7.2
NVIDIA H200-NVL-141GB Server 14.05 1.95 7.2
NVIDIA H200-SXM-141GB Offline 17.86 2.30 7.8
NVIDIA H200-SXM-141GB Server 16.54 2.12 7.8
NVIDIA L40S Offline 3.80 0.67 5.7
NVIDIA L40S Server 3.61 0.62 5.7
NVIDIA RTX PRO 6000 Offline 11.08 1.38 8.0
NVIDIA RTX PRO 6000 Server 10.88 1.36 8.0
Google TPU v5e Offline 1.75 0.44 4.0
Google TPU v5e Server 1.55 0.39 4.0
Virtualized NVIDIA H200-SXM Offline 18.64 2.33 8.0
Virtualized NVIDIA H200-SXM Server 17.95 2.24 8.0

Offline Scenario: SDXL Offline Chart

Server Scenario: SDXL Server Chart

Analysis Insights:

  • B200 Doubling Performance: Per-accelerator performance is roughly 2x that of H100-SXM.
  • Low Virtualization Overhead: Virtualized H200 performed excellently, even slightly higher than physical H200 in offline tests.
  • GH200 Performance: Comparable to H100-SXM, demonstrating the strength of the Hopper core.

4.4. DeepSeek-R1 Model

DeepSeek-R1 is an emerging large model; results here come from large-scale cluster configurations showing expansion capabilities.

Accelerator Scenario Avg Total Result (Samples/s) Result per Accelerator Avg Accelerators
NVIDIA B200-SXM-180GB Offline 31486.55 3935.82 8.0
NVIDIA B200-SXM-180GB Server 15415.43 1926.93 8.0
NVIDIA GB200 Offline 289712.00 72428.00 72.0
NVIDIA GB200 Server 167578.00 41894.50 72.0

Offline Scenario: Deepseek Offline Chart

Server Scenario: Deepseek Server Chart

Analysis Insights:

  • Scale Effects: GB200's performance per accelerator is far higher than B200, revealing the massive non-linear performance gains from the NVL72 cluster and high-speed interconnects.
  • Server Scenario Challenges: Performance drops significantly in real-time server scenarios, reflecting the difficulty of maintaining utilization under low-latency constraints.

4.5. Llama3.1-405b Model

Llama3.1-405b is an ultimate test of AI hardware performance and memory capacity with its 405B parameters.

Accelerator Scenario Avg Total Result (Samples/s) Result per Accelerator Avg Accelerators
NVIDIA B200-SXM-180GB Offline 1613.16 201.65 8.0
NVIDIA B200-SXM-180GB Server 1179.05 147.38 8.0
NVIDIA GB200 Offline 6233.13 1558.28 31.2
NVIDIA GB200 Server 4433.33 1108.33 31.2
NVIDIA H100-SXM-80GB Offline 794.56 99.32 16.0
NVIDIA H100-SXM-80GB Server 557.35 69.67 16.0
NVIDIA H200-SXM-141GB Offline 610.17 76.27 8.8
NVIDIA H200-SXM-141GB Server 319.86 39.98 8.8
Virtualized NVIDIA H200-SXM Offline 547.26 68.41 8.0
Virtualized NVIDIA H200-SXM Server 277.33 34.67 8.0

Offline Scenario: Llama3.1 Offline Chart

Server Scenario: Llama3.1 Server Chart

Analysis Insights:

  • GB200 Built for Giants: GB200 shows its power in handling ultra-large models through massive cluster scaling.
  • Memory Bottlenecks: Interestingly, H200 showed lower per-accelerator performance here than H100, likely due to H100 tests using more GPUs for tensor parallelism while H200 used fewer, creating communication bottlenecks.
  • Blackwell Success: B200 yields a 2x generational leap over H100 for this ultra-large model.

5. Conclusion

  • Generational evolution is the key driver: Blackwell (B200, GB200) sets a new benchmark for AI performance.
  • The "Bucket Effect": Compute, Memory, and Interconnects:
    • For large models, memory bandwidth is often the primary bottleneck.
    • When models exceed single-node capacity, high-speed interconnects (NVLink) and topology become critical, as seen in GB200's results.
  • Product Positioning: Data center flagships (B200, H200 SXM) are the choices for cutting-edge AI, while PCIe versions suit mid-scale tasks.
  • Virtualization is Mature: GPU virtualization offers minimal overhead for AI inference deployment.

Choosing AI hardware is a multi-dimensional decision. This report aims to support those decisions with quantitative data from MLCommons benchmarks.

← Back to Blog