Cluster Scale 2026: Colossus, Stargate, Project Rainier
Cluster Scale 2026
TL;DR
- xAI Colossus 1: ~200,000 GPUs (150K H100 + 50K H200) plus ~30K GB200, ) built in a former Memphis Electrolux plant; first 100K phase deployed in 122 days, doubled in another 92.
- AWS Project Rainier: ~500,000 Trainium2 chips across multiple US data centers (Indiana flagship), scaling to >1M; Anthropic exclusive tenant; ~5× the compute Anthropic used for prior Claude generations.
- OpenAI Stargate: a $500B, 10-GW commitment over four years; ~7 GW under contract by end-2025 (Abilene flagship live, Texas / NM / Ohio / WI sites; Stargate UAE).
Key facts
| Cluster | Chips | Type | Power | Tenant | Status |
|---|---|---|---|---|---|
| Colossus 1 | 200K + 30K GB200 ) | NVIDIA H100/H200 + GB200 | ~250 MW + gas turbines | xAI (Grok) | Live |
| Colossus 2 (Mississippi) | target 550K+ Blackwell | NVIDIA GB200/GB300 | gigawatt-scale | xAI | Construction 2026 |
| Project Rainier | ~500K Trainium2 (→1M) | AWS Trainium2 → Trainium3 | multi-site | Anthropic | Live, expanding |
| Stargate Abilene | NVIDIA GB200 racks | NVIDIA | ~1 GW potential | OpenAI | Live |
| Stargate total | "2M chips" plan | NVIDIA + custom | 10 GW by 2029 | OpenAI | $500B commitment |
How did xAI build 200K GPUs in 214 days?
xAI took an abandoned 785,000-sq-ft Electrolux plant, brought in 14 mobile gas turbines for power, and used Supermicro liquid-cooled racks (64 GPUs/rack, 1,500 racks for the 100K phase). Networking: Nvidia Spectrum-X Ethernet (not InfiniBand) with BlueField-3 SuperNICs at 400 Gb/s/GPU and SN5600 (51.2 Tb/s) top-of-rack switches.
What is Project Rainier actually?
AWS describes Rainier as an "EC2 UltraCluster of Trainium2 UltraServers." Each UltraServer = 4 servers × 16 chips = 64 Trainium2 chips connected by NeuronLink (blue cables); UltraServers tile via Elastic Fabric Adapter (yellow cables). Anthropic: "we currently use over one million Trainium2 chips to train and serve Claude."
What is Stargate, concretely?
Stargate is OpenAI's umbrella for all future infrastructure (not a single facility). Sites announced through Q1 2026:
| Site | Status | GW |
|---|---|---|
| Abilene, TX (flagship) | Live (GB200 racks since Jun 2025) | up to ~1 |
| Shackelford County, TX | Construction | 1.4 |
| Doña Ana County, NM | Construction | — |
| Lordstown, OH (SoftBank) | Construction | — |
| Milam County, TX (SB Energy) | 2026 | 1.2 |
| Stargate Michigan ("The Barn") | 2026 construction | 1.0 |
| Stargate UAE | 2026 opening | — |
| Stargate Norway / UK / Argentina | Planned | — |
Plus contracts: $300B Oracle (5 years), $100B Nvidia equity-for-compute, $90B AMD warrant deal, $250B Microsoft Azure, $350B Broadcom custom silicon, AWS $50B Trainium.
FAQ
Q: Are these single-fabric clusters or "logical" clusters? Colossus 1 is split across four 25,000-GPU halls; analysts believe they were not designed as one HPL system. Rainier is multi-data-center with EFA cross-site. Stargate Abilene is single-site.
Q: Power per GPU? H100 ≈ 700 W, GB200 up to 1,200 W. A 100K GB200 system = ~120 MW just for chips, doubling with cooling + networking.
Q: 1M GPUs — feasible? xAI's Colossus 2 roadmap, OpenAI's Stargate 10 GW, and Anthropic's "≥5 GW" AWS expansion all imply ~$725B aggregate 2026 hyperscaler capex.
References
- xAI Colossus: x.ai/colossus; SemiAnalysis "Colossus 2"; en.wikipedia.org/wiki/Colossus_(supercomputer)
- AWS Rainier: aboutamazon.com/news/aws/aws-project-rainier-ai-trainium-chips-compute-cluster; anthropic.com/news/anthropic-amazon-compute
- Stargate: openai.com/index/announcing-the-stargate-project/; openai.com/index/five-new-stargate-sites/
Further reading
→ Article 12 · Article 16 (reliability) · Article 24 (energy & capex)