ClimbMix
收藏魔搭社区2025-12-04 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/nv-community/ClimbMix
下载链接
链接失效反馈官方服务:
资源简介:
<div align="center">
<span style="font-family: default; font-size: 1.5em;">ClimbMix Dataset</span>
<div>
🚀 Creating the highest-quality pre-training datasets for LLMs 🌟
</div>
</div>
<div style="display: flex; gap: 10px; margin-top: 15px; justify-content: center;">
<a href="https://arxiv.org/abs/2504.13161" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
📄 PAPER
</a>
<a href="https://huggingface.co/datasets/nvidia/ClimbLab" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🤗 CLIMBLAB
</a>
<a href="https://huggingface.co/datasets/nvidia/ClimbMix" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🤗 CLIMBMIX
</a>
<a href="https://research.nvidia.com/labs/lpr/climb/" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🏠 HOMEPAGE
</a>
</div>
<table>
<tr>
<td align="center">
<img src="assets/cont_pretrain.png" width="300"/><br/>
<sub><b>Figure 1:</b> Continuously training a 1B model yields a 2.0% improvement over Llama-3.2-1B, demonstrating a more efficient scaling trend compared to prior models. </sub>
</td>
<td align="center">
<img src="assets/pretrain_from_scratch.png" width="360"/><br/>
<sub><b>Figure 2:</b> Pre-training a 1B model from scratch on ClimbMix shows better scaling effects than training on other datasets. </sub>
</td>
</tr>
</table>
## Dataset Description
ClimbMix is a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. It was introduced in [this paper](https://huggingface.co/papers/2504.13161).
We proposed a new algorithm to filter and mix the dataset. First, we grouped the data into 1,000 groups based on topic information. Then we applied two classifiers: one to detect advertisements and another to assess the educational value of the text. Each group was scored accordingly, and low-quality data with low scores was removed. Finally, the remaining high-quality groups were mixed using certain weights to generate the final dataset.
This dataset is for research and development only.
## Dataset Details
* **Owner(s):** NVIDIA
* **Creation Date:** Feb. 1, 2025
* **License/Terms of Use:** CC BY-NC 4.0
* **Intended Usage:** Pre-training language models.
* **Format:** Text in parquet format
* **Size:** 400 billion tokens
* **Data Collection Method:** Automated
* **Labeling Method:** Automated
## Usage
The ClimbMix dataset we released contains token sequences that have been tokenized using the GPT-2 tokenizer. If you wish to obtain the raw text, please use the provided script `detokenize_climbmix.py`. For example:
```bash
python detokenize_climbmix.py --input_folder <tokenized_folder> --output_folder <raw_text_folder>
```
We also noticed that some community members have converted and released a raw text version of ClimbMix on Hugging Face: https://huggingface.co/datasets/OptimalScale/ClimbMix. You may consider using this version to save the effort of manual conversion. However, please note that this is not the official release, and we are not responsible for the content or maintenance of community-hosted datasets.
## Training
To help reproduce the results, we provide the training script for ClimbMix in `nanoGPT/train.sh`. The code is based on the [nanoGPT](https://github.com/karpathy/nanoGPT) project and we do not make any changes to the model definition and training process. The main changes are:
1. Preprocessed and tokenized the ClimbMix dataset in `nanoGPT/data/climbmix/prepare.sh`.
2. Modified the training configuration in `nanoGPT/config/train_gpt2_climbmix.py`.
Note: in our paper, we used Llama-2 tokenizer and Llama-2 model architecture, so the results are different but we verified that the scaling trend against other public datasets is the same.
Here we display the training curves of the `gpt-2-xl` model on ClimbMix and other datasets. The validation data is openwebtext. With the above script, you could easily reproduce the results.
<img src="assets/wandb.png" width="500"/>
## Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Citation:
If you find our dataset helpful, please cite the following [paper](https://arxiv.org/abs/2504.13161):
```
@article{diao2025climb,
author = {Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Celine Lin and Jan Kautz and Pavlo Molchanov},
title={CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training},
journal = {arXiv preprint},
year = {2025},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url={https://arxiv.org/abs/2504.13161},
}
```
<div align="center">
<span style="font-family: default; font-size: 1.5em;">ClimbMix 数据集</span>
<div>
🚀 为大语言模型(Large Language Models)打造最高质量的预训练数据集 🌟
</div>
</div>
<div style="display: flex; gap: 10px; margin-top: 15px; justify-content: center;">
<a href="https://arxiv.org/abs/2504.13161" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
📄 论文
</a>
<a href="https://huggingface.co/datasets/nvidia/ClimbLab" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🤗 CLIMBLAB
</a>
<a href="https://huggingface.co/datasets/nvidia/ClimbMix" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🤗 CLIMBMIX
</a>
<a href="https://research.nvidia.com/labs/lpr/climb/" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;">
🏠 主页
</a>
</div>
<table>
<tr>
<td align="center">
<img src="assets/cont_pretrain.png" width="300"/><br/>
<sub><b>图1:</b> 对10亿参数模型进行持续预训练,相比Llama-3.2-1B可获得2.0%的性能提升,展现出比此前模型更高效的缩放趋势。</sub>
</td>
<td align="center">
<img src="assets/pretrain_from_scratch.png" width="360"/><br/>
<sub><b>图2:</b> 在ClimbMix数据集上从零开始预训练10亿参数模型,相比其他数据集训练展现出更优异的缩放效果。</sub>
</td>
</tr>
</table>
## 数据集描述
ClimbMix是一款紧凑却功能强大的4000亿Token数据集,专为高效预训练设计,在相同Token预算下可实现更优异的模型性能。该数据集首次在[本篇论文](https://huggingface.co/papers/2504.13161)中提出。
我们提出了一种全新的数据集过滤与混合算法:首先基于主题信息将数据划分为1000个分组;随后部署两类分类器,分别用于检测广告文本与评估文本的教育价值;随后为每个分组赋予对应评分,剔除评分较低的低质量数据;最终通过加权混合剩余的高质量分组,得到最终的ClimbMix数据集。
本数据集仅用于研发用途。
## 数据集详情
* **所有者:** NVIDIA(英伟达)
* **创建日期:** 2025年2月1日
* **许可/使用条款:** CC BY-NC 4.0
* **预期用途:** 语言模型预训练
* **数据格式:** Parquet格式文本
* **数据规模:** 4000亿Token
* **数据采集方式:** 自动化采集
* **标注方式:** 自动化标注
## 使用说明
我们发布的ClimbMix数据集包含使用GPT-2分词器处理后的Token序列。若需获取原始文本,请使用官方提供的`detokenize_climbmix.py`脚本,示例如下:
bash
python detokenize_climbmix.py --input_folder <分词后数据文件夹> --output_folder <原始文本文件夹>
我们注意到部分社区成员已在Hugging Face平台上转换并发布了ClimbMix的原始文本版本:https://huggingface.co/datasets/OptimalScale/ClimbMix。您可选择使用该版本以省去手动转换的工作量,但请注意,此版本并非官方发布,我们不对社区托管数据集的内容与维护负责。
## 训练复现
为便于复现实验结果,我们提供了ClimbMix数据集的训练脚本`nanoGPT/train.sh`。该代码基于[nanoGPT](https://github.com/karpathy/nanoGPT)项目开发,未对模型定义与训练流程做出修改,仅做出两处主要调整:
1. 在`nanoGPT/data/climbmix/prepare.sh`中完成ClimbMix数据集的预处理与分词操作;
2. 在`nanoGPT/config/train_gpt2_climbmix.py`中修改训练配置。
注意:在我们的论文中,我们使用了Llama-2分词器与Llama-2模型架构,因此实验结果存在差异,但我们已验证,相较于其他公开数据集,本数据集的缩放趋势保持一致。
下图展示了`gpt-2-xl`模型在ClimbMix与其他数据集上的训练曲线,验证集采用OpenWebText。通过上述脚本,您可轻松复现实验结果。
<img src="assets/wandb.png" width="500"/>
## 伦理考量
NVIDIA认为,可信人工智能是一项共同责任,我们已建立相关政策与实践规范,以支持各类人工智能应用的开发。开发者在按照服务条款下载或使用本数据集时,应与内部模型团队协作,确保模型符合相关行业与应用场景的要求,并防范潜在的产品误用风险。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞或NVIDIA人工智能相关问题反馈。
## 引用方式
若您认为本数据集对您的研究有所帮助,请引用以下[论文](https://arxiv.org/abs/2504.13161):
@article{diao2025climb,
author = {Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Celine Lin and Jan Kautz and Pavlo Molchanov},
title={CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training},
journal = {arXiv preprint},
year = {2025},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url={https://arxiv.org/abs/2504.13161},
}
提供机构:
maas
创建时间:
2025-04-21



