Nemotron-ClimbMix
收藏魔搭社区2025-12-04 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Nemotron-ClimbMix
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<span style="font-family: default; font-size: 1.5em;">ClimbMix Dataset</span>
<div>
🚀 Creating the highest-quality pre-training datasets for LLMs 🌟
</div>
</div>
<div style="display: flex; gap: 10px; margin-top: 15px; justify-content: center;">
<a href="https://arxiv.org/abs/2504.13161" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
📄 PAPER
</a>
<a href="https://huggingface.co/datasets/nvidia/ClimbLab" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🤗 CLIMBLAB
</a>
<a href="https://huggingface.co/datasets/nvidia/ClimbMix" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🤗 CLIMBMIX
</a>
<a href="https://research.nvidia.com/labs/lpr/climb/" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🏠 HOMEPAGE
</a>
</div>
<table>
<tr>
<td align="center">
<img src="assets/cont_pretrain.png" width="300"/><br/>
<sub><b>Figure 1:</b> Continuously training a 1B model yields a 2.0% improvement over Llama-3.2-1B, demonstrating a more efficient scaling trend compared to prior models. </sub>
</td>
<td align="center">
<img src="assets/pretrain_from_scratch.png" width="360"/><br/>
<sub><b>Figure 2:</b> Pre-training a 1B model from scratch on ClimbMix shows better scaling effects than training on other datasets. </sub>
</td>
</tr>
</table>
## Dataset Description
ClimbMix is a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. It was introduced in [this paper](https://huggingface.co/papers/2504.13161).
We proposed a new algorithm to filter and mix the dataset. First, we grouped the data into 1,000 groups based on topic information. Then we applied two classifiers: one to detect advertisements and another to assess the educational value of the text. Each group was scored accordingly, and low-quality data with low scores was removed. Finally, the remaining high-quality groups were mixed using certain weights to generate the final dataset.
This dataset is for research and development only.
## Dataset Details
* **Owner(s):** NVIDIA
* **Creation Date:** Feb. 1, 2025
* **License/Terms of Use:** CC BY-NC 4.0
* **Intended Usage:** Pre-training language models.
* **Format:** Text in parquet format
* **Size:** 400 billion tokens
* **Data Collection Method:** Automated
* **Labeling Method:** Automated
## Usage
The ClimbMix dataset we released contains token sequences that have been tokenized using the GPT-2 tokenizer. If you wish to obtain the raw text, please use the provided script `detokenize_climbmix.py`. For example:
```bash
python detokenize_climbmix.py --input_folder <tokenized_folder> --output_folder <raw_text_folder>
```
We also noticed that some community members have converted and released a raw text version of ClimbMix on Hugging Face: https://huggingface.co/datasets/OptimalScale/ClimbMix. You may consider using this version to save the effort of manual conversion. However, please note that this is not the official release, and we are not responsible for the content or maintenance of community-hosted datasets.
## Training
To help reproduce the results, we provide the training script for ClimbMix in `nanoGPT/train.sh`. The code is based on the [nanoGPT](https://github.com/karpathy/nanoGPT) project and we do not make any changes to the model definition and training process. The main changes are:
1. Preprocessed and tokenized the ClimbMix dataset in `nanoGPT/data/climbmix/prepare.sh`.
2. Modified the training configuration in `nanoGPT/config/train_gpt2_climbmix.py`.
Note: in our paper, we used Llama-2 tokenizer and Llama-2 model architecture, so the results are different but we verified that the scaling trend against other public datasets is the same.
Here we display the training curves of the `gpt-2-xl` model on ClimbMix and other datasets. The validation data is openwebtext. With the above script, you could easily reproduce the results.
<img src="assets/wandb.png" width="500"/>
## Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Citation:
If you find our dataset helpful, please cite the following [paper](https://arxiv.org/abs/2504.13161):
```
@article{diao2025climb,
author = {Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Celine Lin and Jan Kautz and Pavlo Molchanov},
title={CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training},
journal = {arXiv preprint},
year = {2025},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url={https://arxiv.org/abs/2504.13161},
}
```
<div align="center">
<span style="font-family: default; font-size: 1.5em;">ClimbMix 数据集</span>
<div>
🚀 为大语言模型(Large Language Models, LLMs)打造最高质量的预训练数据集 🌟
</div>
</div>
<div style="display: flex; gap: 10px; margin-top: 15px; justify-content: center;">
<a href="https://arxiv.org/abs/2504.13161" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
📄 论文
</a>
<a href="https://huggingface.co/datasets/nvidia/ClimbLab" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🤗 CLIMBLAB
</a>
<a href="https://huggingface.co/datasets/nvidia/ClimbMix" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🤗 CLIMBMIX
</a>
<a href="https://research.nvidia.com/labs/lpr/climb/" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🏠 主页
</a>
</div>
<table>
<tr>
<td align="center">
<img src="assets/cont_pretrain.png" width="300"/><br/>
<sub><b>图1:</b>在ClimbMix上持续训练10亿参数模型,相较Llama-3.2-1B可实现2.0%的性能提升,展现出比此前模型更高效的缩放趋势。</sub>
</td>
<td align="center">
<img src="assets/pretrain_from_scratch.png" width="360"/><br/>
<sub><b>图2:</b>在ClimbMix上从零开始预训练10亿参数模型,相比其他数据集的训练效果,展现出更优异的缩放性能。</sub>
</td>
</tr>
</table>
## 数据集描述
ClimbMix是一款紧凑却功能强大的4000亿Token数据集,专为高效预训练设计,在相同Token预算下可实现更优异的模型性能。该数据集在[本篇论文](https://huggingface.co/papers/2504.13161)中首次提出。
我们提出了一种全新的数据集筛选与混合算法:首先基于主题信息将数据划分为1000个分组;随后部署两个分类器,分别用于检测广告内容与评估文本的教育价值;随后为每个分组赋予对应评分,剔除评分较低的低质量数据;最终通过特定权重配比混合剩余的高质量分组,得到最终的ClimbMix数据集。
本数据集仅用于研发用途。
## 数据集详情
* **版权方:** NVIDIA(英伟达)
* **创建日期:** 2025年2月1日
* **授权协议/使用条款:** CC BY-NC 4.0
* **适用场景:** 大语言模型预训练
* **数据格式:** Parquet格式文本
* **数据规模:** 4000亿Token
* **数据采集方式:** 自动化采集
* **标注方式:** 自动化标注
## 使用方法
我们发布的ClimbMix数据集包含使用GPT-2 Tokenizer(分词器)处理得到的Token序列。若需获取原始文本,请使用提供的`detokenize_climbmix.py`脚本,示例如下:
bash
python detokenize_climbmix.py --input_folder <分词后数据集文件夹路径> --output_folder <原始文本文件夹路径>
我们注意到部分社区成员已将ClimbMix转换为原始文本版本并发布至Hugging Face平台:https://huggingface.co/datasets/OptimalScale/ClimbMix。您可选择使用该版本以省去手动转换的工作量,但请注意该版本并非官方发布,我们不对社区托管数据集的内容与维护负责。
## 训练复现
为便于复现实验结果,我们在`nanoGPT/train.sh`中提供了ClimbMix的训练脚本。该代码基于[nanoGPT](https://github.com/karpathy/nanoGPT)项目开发,我们未对模型定义与训练流程做出修改,主要变更点如下:
1. 在`nanoGPT/data/climbmix/prepare.sh`中完成ClimbMix数据集的预处理与分词操作;
2. 在`nanoGPT/config/train_gpt2_climbmix.py`中修改训练配置。
注:在我们的论文中,我们使用了Llama-2分词器与Llama-2模型架构,因此实验结果存在差异,但我们验证了与其他公开数据集相比的缩放趋势保持一致。
此处展示了`gpt-2-xl`模型在ClimbMix与其他数据集上的训练曲线,验证数据集为openwebtext。通过上述脚本,您可轻松复现实验结果。
<img src="assets/wandb.png" width="500"/>
## 伦理考量
英伟达(NVIDIA)认为,可信人工智能(Trustworthy AI)是一项共同责任,我们已建立相关政策与实践规范,以支持各类人工智能应用的开发。开发者在按照本服务条款下载或使用本数据集时,应与内部模型团队协作,确保该模型符合相关行业与应用场景的要求,并应对可能出现的产品误用问题。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞报告或英伟达人工智能相关问题反馈。
## 引用方式
若您认为本数据集对您的研究有所帮助,请引用以下[论文](https://arxiv.org/abs/2504.13161):
bibtex
@article{diao2025climb,
author = {Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Celine Lin and Jan Kautz and Pavlo Molchanov},
title={CLIMB: 面向语言模型预训练的基于聚类的迭代数据混合引导方法},
journal = {arXiv预印本},
year = {2025},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url={https://arxiv.org/abs/2504.13161},
}
提供机构:
maas
创建时间:
2025-10-24



