1 post with tag nemotron-cc

Quality Filtering for LLM Pre-Training: FineWeb-Edu, DCLM, Nemotron-CC, Ultra-FineWeb

How frontier labs filter trillions of web tokens — heuristic, perplexity-based, and classifier-based filtering, with concrete recipes from FineWeb-Edu and DCLM.

· 2 minutes reading time