Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb
Quality Filtering: FineWeb-Edu, DCLM, Ultra-FineWeb
TL;DR
- Classifier-based filtering is the single biggest open-data lever: FineWeb-Edu yields ~10× the MMLU lift of unfiltered FineWeb.
- Two recipes dominate in 2026: (1) train a small classifier on LLM-rated documents (FineWeb-Edu); (2) use fastText classifiers + perplexity (DCLM, Ultra-FineWeb).
- Nemotron-CC pushes the frontier further: 4× more unique real tokens than DCLM, with +5 MMLU / +3.1 ARC-Challenge over Llama 3.1 8B at 15T tokens.
Key facts
| Pipeline | Token output | Filter type | Reference |
|---|---|---|---|
| FineWeb-Edu | 1.3T from FineWeb 15T | Llama-3-70B-rated → linear classifier | Penedo et al., arXiv:2406.17557 |
| DCLM-baseline | 3.8T | fastText + perplexity + heuristic | Li et al., 2024 |
| Nemotron-CC | 6.3T unique | classifier ensemble + synthetic rephrase | Su et al., arXiv:2412.02595 |
| Ultra-FineWeb-en | ~1T | fastText (dim 256, n-gram 3) | Wang et al., 2025 |
How does classifier-based filtering work?
- Take ~500K documents from your corpus.
- Score each with a strong LLM (FineWeb-Edu uses Llama-3-70B-Instruct rating 0–5 on "educational value").
- Train a small classifier (FineWeb-Edu: linear head on Snowflake-arctic-embed-m embeddings).
- Apply to your full corpus. Cost: 6,000 H100-hours for 15T tokens.
Why does it work?
FineWeb-Edu's threshold = 3 keeps ~8.7% of tokens but lifts MMLU by ~3 points and ARC by ~5 points vs full FineWeb.