Do, or not do. There is no try.
– 6AI6
6AI6 curated.
Multi-head Latent Attention (MLA): The KV-Cache Compression Behind DeepSeek-V3
How DeepSeek's Multi-head Latent Attention compresses the KV cache via low-rank projections + decoupled RoPE, achieving large memory reductions versus MHA at equal or better quality.
·
2 minutes reading time
·
2 minutes reading time
·
2 minutes reading time
LLM Pre-Training in 2026: The Frontier in Numbers
The state of frontier LLM pre-training in 2026 — token counts, parameter counts, cluster sizes, costs, and what it all means for CTOs and ML leads.
·
3 minutes reading time
Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb
How frontier labs filter trillions of web tokens — heuristic, perplexity-based, and classifier-based filtering, with concrete recipes from FineWeb-Edu and DCLM.
·
2 minutes reading time
·
1 minute reading time