Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

· 732 words · 4 minutes reading time

Frontier AI Training Hardware in 2026

TL;DR

  • NVIDIA ships the most-deployed training chips: H100 (700 W, 3.35 TB/s HBM3, 989 BF16 dense / 1,979 FP8 dense TFLOPS), H200 (141 GB HBM3e, 4.8 TB/s, same compute), and Blackwell B200 / GB200 NVL72 (per-GPU 5 PFLOPS FP8 dense / 10 PFLOPS sparse; rack 1.44 EFLOPS FP4 sparse).
  • Google's Ironwood (TPU v7) delivers 4,614 TFLOPS FP8 per chip, 192 GB HBM3e, and 42.5 EFLOPS FP8 per 9,216-chip pod — roughly closing the gap to NVIDIA on a per-chip basis.
  • AWS Trainium2 = 1.3 PFLOPS FP8 dense / chip; UltraServer = 64 chips / 83.2 PFLOPS / 6 TB HBM. Trainium3 (3 nm, 2026) doubles per-chip compute to 2.52 PFLOPS FP8 with 144 GB HBM3e.

Master comparison table (primary sources)

ChipDense FP8 (TFLOPS)HBMHBM BWInterconnectTDP
H100 SXM51,97980 GB HBM33.35 TB/sNVLink 4: 900 GB/s700 W
H200 SXM1,979141 GB HBM3e4.8 TB/sNVLink 4: 900 GB/s700 W
B200 (HGX)4,500 (dense)180 GB HBM3e7.7 TB/sNVLink 5: 1.8 TB/s~1,000 W
GB200 GPU5,000 (dense)186 GB HBM3e8 TB/sNVLink 5: 1.8 TB/sup to 1,200 W
TPU v5p(BF16) 459 TFLOPS95 GB HBM2e2.76 TB/sICI 4,800 Gbps/chip, 8,960-chip pod
TPU v6e Trillium(BF16) ~918 TFLOPS32 GB HBM~1.6 TB/s256-chip pod
TPU v7 Ironwood4,614192 GB HBM3e7.37 TB/s9,216-chip pod~1 kW
Trainium21,30096 GB HBM~2.9 TB/sNeuronLink, 64-chip UltraServer
Trainium3 (3 nm)2,520 (MXFP8)144 GB HBM3e4.9 TB/sNeuronLink v4, 144-chip UltraServer

Pod / rack totals

SystemAggregate FP8 (dense)Aggregate HBMAggregate interconnect
GB200 NVL7272 × 5 PF = 360 PFLOPS dense (720 PF sparse)~13.4 TB HBM3e130 TB/s NVLink
TPU v7 Ironwood pod~42.5 EFLOPS FP89,216 × 192 GB = ~1.7 PB HBM3e3D-torus ICI
Trn2 UltraServer64 × 1.3 = 83.2 PFLOPS FP86 TB HBM185 TB/s
Trn3 UltraServer144 × 2.52 = 362 PFLOPS MXFP820.7 TB HBM3e706 TB/s

Sources (verbatim quotes)

  • H200: "the NVIDIA H200 is the first GPU to offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)" (nvidia.com/en-us/data-center/h200/).
  • GB200 NVL72: "72 NVIDIA Blackwell GPUs interconnected by the largest NVIDIA NVLink domain ever offered, NVLink Switch System provides 130 terabytes per second (TB/s) of low-latency GPU communications" (nvidia.com/en-us/data-center/gb200-nvl72/).
  • TPU v5p: "Each TPU v5p pod composes together 8,960 chips over our highest-bandwidth inter-chip interconnect (ICI) at 4,800 Gbps/chip in a 3D torus topology" (cloud.google.com).
  • TPU Ironwood: "TPU7x is the first release within the Ironwood family… With a 9,216-chip footprint per Pod… Each chip is equipped with 192 GB of HBM, with bandwidth of approximately 7.37 TB/s" (docs.cloud.google.com/tpu/docs/tpu7x).
  • Trainium2: "Trn2 instances feature 16 Trainium2 chips… up to 20.8 FP8 petaflops of compute. Trn2 UltraServers extend NeuronLink connectivity to 64 Trainium2 chips… up to 83.2 FP8 petaflops of compute" (aws.amazon.com/ec2/instance-types/trn2/).
  • Trainium3: "AWS Trainium3 chip provides 2x higher compute performance to 2.52 petaflops (PFLOPs) of FP8 compute, increases the memory capacity by 1.5x and bandwidth by 1.7x over Trainium2 to 144 GB of HBM3e memory, and 4.9 TB/s of memory bandwidth" (aws.amazon.com/ai/machine-learning/trainium/).

FAQ

Q: How does Ironwood compare to GB200? Per-chip FP8: Ironwood ~4.6 PF vs GB200 5 PF dense — essentially equal. Pod scale: 9,216-chip Ironwood pod = 42.5 EFLOPS vs 72-chip NVL72 rack = 360 PFLOPS dense. Google's scale-up unit is now ~100× larger than NVIDIA's.

Q: When does FP4 become production? GB200 supports it; DeepSeek's V3 paper and the SemiAnalysis Trainium3 piece both note FP4 as the next step.

Q: Why ~1 kW chips? HBM stacks + tensor cores at higher clock — power density driving the gigawatt data-center wave.

References

  • NVIDIA HGX B200 OEM datasheet; nvidia.com/en-us/data-center/{h100,h200,gb200-nvl72}/
  • developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/
  • cloud.google.com/blog/products/compute/{introducing-trillium-6th-gen-tpus,ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads}
  • aws.amazon.com/ec2/instance-types/{trn2,trn3}/
  • SemiAnalysis: TPUv7 deep-dive; AWS Trainium3 deep-dive.

Further reading

→ Article 13 (Cluster scale) · Article 14 (5D parallelism) · Article 15 (FP8)

Series

LLM Pre-Training 2026

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Side-by-side specifications of every chip and pod system used at the 2026 LLM frontier, with primary-source numbers from NVIDIA, Google Cloud, and AWS.