Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

May 25, 2026 · 234 words · 2 minutes reading time

Quality Filtering: FineWeb-Edu, DCLM, Ultra-FineWeb

Classifier-based filtering is the single biggest open-data lever: FineWeb-Edu yields ~10× the MMLU lift of unfiltered FineWeb.
Two recipes dominate in 2026: (1) train a small classifier on LLM-rated documents (FineWeb-Edu); (2) use fastText classifiers + perplexity (DCLM, Ultra-FineWeb).
Nemotron-CC pushes the frontier further: 4× more unique real tokens than DCLM, with +5 MMLU / +3.1 ARC-Challenge over Llama 3.1 8B at 15T tokens.

Pipeline	Token output	Filter type	Reference
FineWeb-Edu	1.3T from FineWeb 15T	Llama-3-70B-rated → linear classifier	Penedo et al., arXiv:2406.17557
DCLM-baseline	3.8T	fastText + perplexity + heuristic	Li et al., 2024
Nemotron-CC	6.3T unique	classifier ensemble + synthetic rephrase	Su et al., arXiv:2412.02595
Ultra-FineWeb-en	~1T	fastText (dim 256, n-gram 3)	Wang et al., 2025

Take ~500K documents from your corpus.
Score each with a strong LLM (FineWeb-Edu uses Llama-3-70B-Instruct rating 0–5 on "educational value").
Train a small classifier (FineWeb-Edu: linear head on Snowflake-arctic-embed-m embeddings).
Apply to your full corpus. Cost: 6,000 H100-hours for 15T tokens.

FineWeb-Edu's threshold = 3 keeps ~8.7% of tokens but lifts MMLU by ~3 points and ARC by ~5 points vs full FineWeb.