Cluster Scale 2026: Colossus, Stargate, Project Rainier

Cluster Scale 2026: Colossus, Stargate, Project Rainier

· 528 words · 3 minutes reading time

Cluster Scale 2026

TL;DR

  • xAI Colossus 1: ~200,000 GPUs (150K H100 + 50K H200) plus ~30K GB200, ) built in a former Memphis Electrolux plant; first 100K phase deployed in 122 days, doubled in another 92.
  • AWS Project Rainier: ~500,000 Trainium2 chips across multiple US data centers (Indiana flagship), scaling to >1M; Anthropic exclusive tenant; ~5× the compute Anthropic used for prior Claude generations.
  • OpenAI Stargate: a $500B, 10-GW commitment over four years; ~7 GW under contract by end-2025 (Abilene flagship live, Texas / NM / Ohio / WI sites; Stargate UAE).

Key facts

ClusterChipsTypePowerTenantStatus
Colossus 1200K + 30K GB200 )NVIDIA H100/H200 + GB200~250 MW + gas turbinesxAI (Grok)Live
Colossus 2 (Mississippi)target 550K+ BlackwellNVIDIA GB200/GB300gigawatt-scalexAIConstruction 2026
Project Rainier~500K Trainium2 (→1M)AWS Trainium2 → Trainium3multi-siteAnthropicLive, expanding
Stargate AbileneNVIDIA GB200 racksNVIDIA~1 GW potentialOpenAILive
Stargate total"2M chips" planNVIDIA + custom10 GW by 2029OpenAI$500B commitment

How did xAI build 200K GPUs in 214 days?

xAI took an abandoned 785,000-sq-ft Electrolux plant, brought in 14 mobile gas turbines for power, and used Supermicro liquid-cooled racks (64 GPUs/rack, 1,500 racks for the 100K phase). Networking: Nvidia Spectrum-X Ethernet (not InfiniBand) with BlueField-3 SuperNICs at 400 Gb/s/GPU and SN5600 (51.2 Tb/s) top-of-rack switches.

What is Project Rainier actually?

AWS describes Rainier as an "EC2 UltraCluster of Trainium2 UltraServers." Each UltraServer = 4 servers × 16 chips = 64 Trainium2 chips connected by NeuronLink (blue cables); UltraServers tile via Elastic Fabric Adapter (yellow cables). Anthropic: "we currently use over one million Trainium2 chips to train and serve Claude."

What is Stargate, concretely?

Stargate is OpenAI's umbrella for all future infrastructure (not a single facility). Sites announced through Q1 2026:

SiteStatusGW
Abilene, TX (flagship)Live (GB200 racks since Jun 2025)up to ~1
Shackelford County, TXConstruction1.4
Doña Ana County, NMConstruction
Lordstown, OH (SoftBank)Construction
Milam County, TX (SB Energy)20261.2
Stargate Michigan ("The Barn")2026 construction1.0
Stargate UAE2026 opening
Stargate Norway / UK / ArgentinaPlanned

Plus contracts: $300B Oracle (5 years), $100B Nvidia equity-for-compute, $90B AMD warrant deal, $250B Microsoft Azure, $350B Broadcom custom silicon, AWS $50B Trainium.

FAQ

Q: Are these single-fabric clusters or "logical" clusters? Colossus 1 is split across four 25,000-GPU halls; analysts believe they were not designed as one HPL system. Rainier is multi-data-center with EFA cross-site. Stargate Abilene is single-site.

Q: Power per GPU? H100 ≈ 700 W, GB200 up to 1,200 W. A 100K GB200 system = ~120 MW just for chips, doubling with cooling + networking.

Q: 1M GPUs — feasible? xAI's Colossus 2 roadmap, OpenAI's Stargate 10 GW, and Anthropic's "≥5 GW" AWS expansion all imply ~$725B aggregate 2026 hyperscaler capex.

References

  • xAI Colossus: x.ai/colossus; SemiAnalysis "Colossus 2"; en.wikipedia.org/wiki/Colossus_(supercomputer)
  • AWS Rainier: aboutamazon.com/news/aws/aws-project-rainier-ai-trainium-chips-compute-cluster; anthropic.com/news/anthropic-amazon-compute
  • Stargate: openai.com/index/announcing-the-stargate-project/; openai.com/index/five-new-stargate-sites/

Further reading

→ Article 12 · Article 16 (reliability) · Article 24 (energy & capex)

Series

LLM Pre-Training 2026

Cluster Scale 2026: Colossus, Stargate, Project Rainier

From 16K to 1M-GPU systems — how xAI built Colossus in 122 days, how AWS deployed 500K Trainium2 chips for Anthropic, and what OpenAI's $500B Stargate commitment actually covers.