Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

· 234 words · 2 minutes reading time

Quality Filtering: FineWeb-Edu, DCLM, Ultra-FineWeb

TL;DR

  • Classifier-based filtering is the single biggest open-data lever: FineWeb-Edu yields ~10× the MMLU lift of unfiltered FineWeb.
  • Two recipes dominate in 2026: (1) train a small classifier on LLM-rated documents (FineWeb-Edu); (2) use fastText classifiers + perplexity (DCLM, Ultra-FineWeb).
  • Nemotron-CC pushes the frontier further: 4× more unique real tokens than DCLM, with +5 MMLU / +3.1 ARC-Challenge over Llama 3.1 8B at 15T tokens.

Key facts

PipelineToken outputFilter typeReference
FineWeb-Edu1.3T from FineWeb 15TLlama-3-70B-rated → linear classifierPenedo et al., arXiv:2406.17557
DCLM-baseline3.8TfastText + perplexity + heuristicLi et al., 2024
Nemotron-CC6.3T uniqueclassifier ensemble + synthetic rephraseSu et al., arXiv:2412.02595
Ultra-FineWeb-en~1TfastText (dim 256, n-gram 3)Wang et al., 2025

How does classifier-based filtering work?

  1. Take ~500K documents from your corpus.
  2. Score each with a strong LLM (FineWeb-Edu uses Llama-3-70B-Instruct rating 0–5 on "educational value").
  3. Train a small classifier (FineWeb-Edu: linear head on Snowflake-arctic-embed-m embeddings).
  4. Apply to your full corpus. Cost: 6,000 H100-hours for 15T tokens.

Why does it work?

FineWeb-Edu's threshold = 3 keeps ~8.7% of tokens but lifts MMLU by ~3 points and ARC by ~5 points vs full FineWeb.

Code (FineWeb-Edu classifier inference, simplified)

Series

LLM Pre-Training 2026

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

How frontier labs filter trillions of web tokens — heuristic, perplexity-based, and classifier-based filtering, with concrete recipes from FineWeb-Edu and DCLM.