Cluster Scale 2026: Colossus, Stargate, Project Rainier

May 20, 2026 · 528 words · 3 minutes reading time

Cluster Scale 2026

TL;DR

xAI Colossus 1: ~200,000 GPUs (150K H100 + 50K H200) plus ~30K GB200, ) built in a former Memphis Electrolux plant; first 100K phase deployed in 122 days, doubled in another 92.
AWS Project Rainier: ~500,000 Trainium2 chips across multiple US data centers (Indiana flagship), scaling to >1M; Anthropic exclusive tenant; ~5× the compute Anthropic used for prior Claude generations.
OpenAI Stargate: a $500B, 10-GW commitment over four years; ~7 GW under contract by end-2025 (Abilene flagship live, Texas / NM / Ohio / WI sites; Stargate UAE).

Key facts

Cluster	Chips	Type	Power	Tenant	Status
Colossus 1	200K + 30K GB200 )	NVIDIA H100/H200 + GB200	~250 MW + gas turbines	xAI (Grok)	Live
Colossus 2 (Mississippi)	target 550K+ Blackwell	NVIDIA GB200/GB300	gigawatt-scale	xAI	Construction 2026
Project Rainier	~500K Trainium2 (→1M)	AWS Trainium2 → Trainium3	multi-site	Anthropic	Live, expanding
Stargate Abilene	NVIDIA GB200 racks	NVIDIA	~1 GW potential	OpenAI	Live
Stargate total	"2M chips" plan	NVIDIA + custom	10 GW by 2029	OpenAI	$500B commitment

How did xAI build 200K GPUs in 214 days?

xAI took an abandoned 785,000-sq-ft Electrolux plant, brought in 14 mobile gas turbines for power, and used Supermicro liquid-cooled racks (64 GPUs/rack, 1,500 racks for the 100K phase). Networking: Nvidia Spectrum-X Ethernet (not InfiniBand) with BlueField-3 SuperNICs at 400 Gb/s/GPU and SN5600 (51.2 Tb/s) top-of-rack switches.

What is Project Rainier actually?

AWS describes Rainier as an "EC2 UltraCluster of Trainium2 UltraServers." Each UltraServer = 4 servers × 16 chips = 64 Trainium2 chips connected by NeuronLink (blue cables); UltraServers tile via Elastic Fabric Adapter (yellow cables). Anthropic: "we currently use over one million Trainium2 chips to train and serve Claude."

What is Stargate, concretely?

Stargate is OpenAI's umbrella for all future infrastructure (not a single facility). Sites announced through Q1 2026:

Site	Status	GW
Abilene, TX (flagship)	Live (GB200 racks since Jun 2025)	up to ~1
Shackelford County, TX	Construction	1.4
Doña Ana County, NM	Construction	—
Lordstown, OH (SoftBank)	Construction	—
Milam County, TX (SB Energy)	2026	1.2
Stargate Michigan ("The Barn")	2026 construction	1.0
Stargate UAE	2026 opening	—
Stargate Norway / UK / Argentina	Planned	—

Plus contracts: $300B Oracle (5 years), $100B Nvidia equity-for-compute, $90B AMD warrant deal, $250B Microsoft Azure, $350B Broadcom custom silicon, AWS $50B Trainium.

FAQ

Q: Are these single-fabric clusters or "logical" clusters? Colossus 1 is split across four 25,000-GPU halls; analysts believe they were not designed as one HPL system. Rainier is multi-data-center with EFA cross-site. Stargate Abilene is single-site.

Q: Power per GPU? H100 ≈ 700 W, GB200 up to 1,200 W. A 100K GB200 system = ~120 MW just for chips, doubling with cooling + networking.

Q: 1M GPUs — feasible? xAI's Colossus 2 roadmap, OpenAI's Stargate 10 GW, and Anthropic's "≥5 GW" AWS expansion all imply ~$725B aggregate 2026 hyperscaler capex.

References

xAI Colossus: x.ai/colossus; SemiAnalysis "Colossus 2"; en.wikipedia.org/wiki/Colossus_(supercomputer)
AWS Rainier: aboutamazon.com/news/aws/aws-project-rainier-ai-trainium-chips-compute-cluster; anthropic.com/news/anthropic-amazon-compute
Stargate: openai.com/index/announcing-the-stargate-project/; openai.com/index/five-new-stargate-sites/

Cluster Scale 2026: Colossus, Stargate, Project Rainier

Cluster Scale 2026

TL;DR

Key facts

How did xAI build 200K GPUs in 214 days?

What is Project Rainier actually?

What is Stargate, concretely?

FAQ

References

Further reading

Series

LLM Pre-Training in 2026: The Frontier in Numbers

2026 大模型预训练：前沿数据全景

Pre-Training Data Sources & Token Budgets: From Common Crawl to 36T Tokens

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

Mixture-of-Experts Design: DeepSeek-V3, Qwen 3, Llama 4 Compared

Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3

Frontier AI Training Hardware in 2026: H100, H200, GB200 NVL72, TPU Ironwood, Trainium2/3

Cluster Scale 2026: Colossus, Stargate, Project Rainier

FP8 Mixed-Precision Training

Optimizers in 2026: AdamW, Muon, Shampoo/SOAP

Training Cost Economics 2026

Energy, Power & Capex