OptimalScale/ClimbMix
收藏Hugging Face2025-05-04 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/OptimalScale/ClimbMix
下载链接
链接失效反馈官方服务:
资源简介:
ClimbMix是一个由NVIDIA发布的高质量预训练语料库,包含4000亿个token,专为高效预训练设计,在等量token预算下提供优越的性能。该数据集通过一种新的算法进行过滤和混合,首先将数据根据主题信息分组,然后使用两个分类器分别检测广告内容和对文本的教育价值,根据评分移除低质量数据,最后将剩余的高质量数据组按特定权重混合生成最终数据集。由于数据集是以gpt-2 token的形式发布的,不易使用,因此使用了gpt-2 tokenizer将其转换回原始文本。
ClimbMix is a high-quality pre-training corpus released by NVIDIA, containing 400 billion tokens. It is designed for efficient pre-training with superior performance under an equal token budget. The dataset was constructed using a new algorithm that filters and mixes the data, first grouping it based on topic information, then applying two classifiers to detect advertisements and assess the educational value of the text, respectively. Low-quality data with low scores was removed, and the remaining high-quality groups were mixed with certain weights to generate the final dataset. The dataset is released in gpt-2 tokens, which is not easy-to-use, so gpt-2 tokenizer is used to detokenize them into raw texts.
提供机构:
OptimalScale



