Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3
Frontier AI Training Hardware in 2026
TL;DR
- NVIDIA ships the most-deployed training chips: H100 (700 W, 3.35 TB/s HBM3, 989 BF16 dense / 1,979 FP8 dense TFLOPS), H200 (141 GB HBM3e, 4.8 TB/s, same compute), and Blackwell B200 / GB200 NVL72 (per-GPU 5 PFLOPS FP8 dense / 10 PFLOPS sparse; rack 1.44 EFLOPS FP4 sparse).
- Google's Ironwood (TPU v7) delivers 4,614 TFLOPS FP8 per chip, 192 GB HBM3e, and 42.5 EFLOPS FP8 per 9,216-chip pod — roughly closing the gap to NVIDIA on a per-chip basis.
- AWS Trainium2 = 1.3 PFLOPS FP8 dense / chip; UltraServer = 64 chips / 83.2 PFLOPS / 6 TB HBM. Trainium3 (3 nm, 2026) doubles per-chip compute to 2.52 PFLOPS FP8 with 144 GB HBM3e.
Master comparison table (primary sources)
| Chip | Dense FP8 (TFLOPS) | HBM | HBM BW | Interconnect | TDP |
|---|---|---|---|---|---|
| H100 SXM5 | 1,979 | 80 GB HBM3 | 3.35 TB/s | NVLink 4: 900 GB/s | 700 W |
| H200 SXM | 1,979 | 141 GB HBM3e | 4.8 TB/s | NVLink 4: 900 GB/s | 700 W |
| B200 (HGX) | 4,500 (dense) | 180 GB HBM3e | 7.7 TB/s | NVLink 5: 1.8 TB/s | ~1,000 W |
| GB200 GPU | 5,000 (dense) | 186 GB HBM3e | 8 TB/s | NVLink 5: 1.8 TB/s | up to 1,200 W |
| TPU v5p | (BF16) 459 TFLOPS | 95 GB HBM2e | 2.76 TB/s | ICI 4,800 Gbps/chip, 8,960-chip pod | — |
| TPU v6e Trillium | (BF16) ~918 TFLOPS | 32 GB HBM | ~1.6 TB/s | 256-chip pod | — |
| TPU v7 Ironwood | 4,614 | 192 GB HBM3e | 7.37 TB/s | 9,216-chip pod | ~1 kW |
| Trainium2 | 1,300 | 96 GB HBM | ~2.9 TB/s | NeuronLink, 64-chip UltraServer | — |
| Trainium3 (3 nm) | 2,520 (MXFP8) | 144 GB HBM3e | 4.9 TB/s | NeuronLink v4, 144-chip UltraServer | — |
Pod / rack totals
| System | Aggregate FP8 (dense) | Aggregate HBM | Aggregate interconnect |
|---|---|---|---|
| GB200 NVL72 | 72 × 5 PF = 360 PFLOPS dense (720 PF sparse) | ~13.4 TB HBM3e | 130 TB/s NVLink |
| TPU v7 Ironwood pod | ~42.5 EFLOPS FP8 | 9,216 × 192 GB = ~1.7 PB HBM3e | 3D-torus ICI |
| Trn2 UltraServer | 64 × 1.3 = 83.2 PFLOPS FP8 | 6 TB HBM | 185 TB/s |
| Trn3 UltraServer | 144 × 2.52 = 362 PFLOPS MXFP8 | 20.7 TB HBM3e | 706 TB/s |
Sources (verbatim quotes)
- H200: "the NVIDIA H200 is the first GPU to offer 141 gigabytes (GB) of HBM3e memory at 4.8 terabytes per second (TB/s)" (nvidia.com/en-us/data-center/h200/).
- GB200 NVL72: "72 NVIDIA Blackwell GPUs interconnected by the largest NVIDIA NVLink domain ever offered, NVLink Switch System provides 130 terabytes per second (TB/s) of low-latency GPU communications" (nvidia.com/en-us/data-center/gb200-nvl72/).
- TPU v5p: "Each TPU v5p pod composes together 8,960 chips over our highest-bandwidth inter-chip interconnect (ICI) at 4,800 Gbps/chip in a 3D torus topology" (cloud.google.com).
- TPU Ironwood: "TPU7x is the first release within the Ironwood family… With a 9,216-chip footprint per Pod… Each chip is equipped with 192 GB of HBM, with bandwidth of approximately 7.37 TB/s" (docs.cloud.google.com/tpu/docs/tpu7x).
- Trainium2: "Trn2 instances feature 16 Trainium2 chips… up to 20.8 FP8 petaflops of compute. Trn2 UltraServers extend NeuronLink connectivity to 64 Trainium2 chips… up to 83.2 FP8 petaflops of compute" (aws.amazon.com/ec2/instance-types/trn2/).
- Trainium3: "AWS Trainium3 chip provides 2x higher compute performance to 2.52 petaflops (PFLOPs) of FP8 compute, increases the memory capacity by 1.5x and bandwidth by 1.7x over Trainium2 to 144 GB of HBM3e memory, and 4.9 TB/s of memory bandwidth" (aws.amazon.com/ai/machine-learning/trainium/).
FAQ
Q: How does Ironwood compare to GB200? Per-chip FP8: Ironwood ~4.6 PF vs GB200 5 PF dense — essentially equal. Pod scale: 9,216-chip Ironwood pod = 42.5 EFLOPS vs 72-chip NVL72 rack = 360 PFLOPS dense. Google's scale-up unit is now ~100× larger than NVIDIA's.
Q: When does FP4 become production? GB200 supports it; DeepSeek's V3 paper and the SemiAnalysis Trainium3 piece both note FP4 as the next step.
Q: Why ~1 kW chips? HBM stacks + tensor cores at higher clock — power density driving the gigawatt data-center wave.
References
- NVIDIA HGX B200 OEM datasheet; nvidia.com/en-us/data-center/{h100,h200,gb200-nvl72}/
- developer.nvidia.com/blog/nvidia-gb200-nvl72-delivers-trillion-parameter-llm-training-and-real-time-inference/
- cloud.google.com/blog/products/compute/{introducing-trillium-6th-gen-tpus,ironwood-tpus-and-new-axion-based-vms-for-your-ai-workloads}
- aws.amazon.com/ec2/instance-types/{trn2,trn3}/
- SemiAnalysis: TPUv7 deep-dive; AWS Trainium3 deep-dive.
Further reading
→ Article 13 (Cluster scale) · Article 14 (5D parallelism) · Article 15 (FP8)