OptimalScale/ClimbLab
收藏Hugging Face2025-05-04 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/OptimalScale/ClimbLab
下载链接
链接失效反馈官方服务:
资源简介:
ClimbLab是由NVIDIA发布的高质量预训练语料库,基于Nemotron-CC和SmolLM-Corpus构建,使用了CLIMB-clustering算法对数据进行了语义重组和过滤,形成了20个不同的簇,总共有1.2万亿个token的高质量语料库。数据集经过两个分类器处理,分别检测广告内容和评估文本的教育价值,移除了低质量数据,并以gpt-2 tokenizer转换回原始文本形式发布。
ClimbLab is a high-quality pre-training corpus released by NVIDIA, constructed based on Nemotron-CC and SmolLM-Corpus, and semantically reorganized and filtered using the CLIMB-clustering algorithm to form 20 distinct clusters, totaling a 1.2-trillion-token high-quality corpus. The dataset has been processed through two classifiers to detect advertisements and assess the educational value of the text, removing low-quality data, and is published in the form of raw texts converted back from gpt-2 tokens.
提供机构:
OptimalScale



