Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

May 22, 2026 · 732 words · 4 minutes reading time

Frontier AI Training Hardware in 2026

TL;DR

NVIDIA ships the most-deployed training chips: H100 (700 W, 3.35 TB/s HBM3, 989 BF16 dense / 1,979 FP8 dense TFLOPS), H200 (141 GB HBM3e, 4.8 TB/s, same compute), and Blackwell B200 / GB200 NVL72 (per-GPU 5 PFLOPS FP8 dense / 10 PFLOPS sparse; rack 1.44 EFLOPS FP4 sparse).
Google's Ironwood (TPU v7) delivers 4,614 TFLOPS FP8 per chip, 192 GB HBM3e, and 42.5 EFLOPS FP8 per 9,216-chip pod — roughly closing the gap to NVIDIA on a per-chip basis.
AWS Trainium2 = 1.3 PFLOPS FP8 dense / chip; UltraServer = 64 chips / 83.2 PFLOPS / 6 TB HBM. Trainium3 (3 nm, 2026) doubles per-chip compute to 2.52 PFLOPS FP8 with 144 GB HBM3e.

Master comparison table (primary sources)

Chip	Dense FP8 (TFLOPS)	HBM	HBM BW	Interconnect	TDP
H100 SXM5	1,979	80 GB HBM3	3.35 TB/s	NVLink 4: 900 GB/s	700 W
H200 SXM	1,979	141 GB HBM3e	4.8 TB/s	NVLink 4: 900 GB/s	700 W
B200 (HGX)	4,500 (dense)	180 GB HBM3e	7.7 TB/s	NVLink 5: 1.8 TB/s	~1,000 W
GB200 GPU	5,000 (dense)	186 GB HBM3e	8 TB/s	NVLink 5: 1.8 TB/s	up to 1,200 W
TPU v5p	(BF16) 459 TFLOPS	95 GB HBM2e	2.76 TB/s	ICI 4,800 Gbps/chip, 8,960-chip pod	—
TPU v6e Trillium	(BF16) ~918 TFLOPS	32 GB HBM	~1.6 TB/s	256-chip pod	—
TPU v7 Ironwood	4,614	192 GB HBM3e	7.37 TB/s	9,216-chip pod	~1 kW
Trainium2	1,300	96 GB HBM	~2.9 TB/s	NeuronLink, 64-chip UltraServer	—
Trainium3 (3 nm)	2,520 (MXFP8)	144 GB HBM3e	4.9 TB/s	NeuronLink v4, 144-chip UltraServer	—

Pod / rack totals

System	Aggregate FP8 (dense)	Aggregate HBM	Aggregate interconnect
GB200 NVL72	72 × 5 PF = 360 PFLOPS dense (720 PF sparse)	~13.4 TB HBM3e	130 TB/s NVLink
TPU v7 Ironwood pod	~42.5 EFLOPS FP8	9,216 × 192 GB = ~1.7 PB HBM3e	3D-torus ICI
Trn2 UltraServer	64 × 1.3 = 83.2 PFLOPS FP8	6 TB HBM	185 TB/s
Trn3 UltraServer	144 × 2.52 = 362 PFLOPS MXFP8	20.7 TB HBM3e	706 TB/s

Sources (verbatim quotes)

H200: "the NVIDIA H200 is the first GPU to offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)" (nvidia.com/en-us/data-center/h200/).
GB200 NVL72: "72 NVIDIA Blackwell GPUs interconnected by the largest NVIDIA NVLink domain ever offered, NVLink Switch System provides 130 terabytes per second (TB/s) of low-latency GPU communications" (nvidia.com/en-us/data-center/gb200-nvl72/).
TPU v5p: "Each TPU v5p pod composes together 8,960 chips over our highest-bandwidth inter-chip interconnect (ICI) at 4,800 Gbps/chip in a 3D torus topology" (cloud.google.com).
TPU Ironwood: "TPU7x is the first release within the Ironwood family… With a 9,216-chip footprint per Pod… Each chip is equipped with 192 GB of HBM, with bandwidth of approximately 7.37 TB/s" (docs.cloud.google.com/tpu/docs/tpu7x).
Trainium2: "Trn2 instances feature 16 Trainium2 chips… up to 20.8 FP8 petaflops of compute. Trn2 UltraServers extend NeuronLink connectivity to 64 Trainium2 chips… up to 83.2 FP8 petaflops of compute" (aws.amazon.com/ec2/instance-types/trn2/).
Trainium3: "AWS Trainium3 chip provides 2x higher compute performance to 2.52 petaflops (PFLOPs) of FP8 compute, increases the memory capacity by 1.5x and bandwidth by 1.7x over Trainium2 to 144 GB of HBM3e memory, and 4.9 TB/s of memory bandwidth" (aws.amazon.com/ai/machine-learning/trainium/).

FAQ

Q: How does Ironwood compare to GB200? Per-chip FP8: Ironwood ~4.6 PF vs GB200 5 PF dense — essentially equal. Pod scale: 9,216-chip Ironwood pod = 42.5 EFLOPS vs 72-chip NVL72 rack = 360 PFLOPS dense. Google's scale-up unit is now ~100× larger than NVIDIA's.

Q: When does FP4 become production? GB200 supports it; DeepSeek's V3 paper and the SemiAnalysis Trainium3 piece both note FP4 as the next step.

Q: Why ~1 kW chips? HBM stacks + tensor cores at higher clock — power density driving the gigawatt data-center wave.

References

NVIDIA HGX B200 OEM datasheet; nvidia.com/en-us/data-center/{h100,h200,gb200-nvl72}/
developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/
cloud.google.com/blog/products/compute/{introducing-trillium-6th-gen-tpus,ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads}
aws.amazon.com/ec2/instance-types/{trn2,trn3}/
SemiAnalysis: TPUv7 deep-dive; AWS Trainium3 deep-dive.

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Frontier AI Training Hardware in 2026

TL;DR

Master comparison table (primary sources)

Pod / rack totals

Sources (verbatim quotes)

FAQ

References

Further reading

Series

LLM Pre-Training in 2026: The Frontier in Numbers

2026 大模型预训练：前沿数据全景

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Cluster Scale 2026: Colossus, Stargate, Project Rainier

FP8 Mixed-Precision Training

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Training Cost Economics 2026

Energy, Power & Capex