five

ClimbLab

收藏
魔搭社区2025-11-27 更新2025-04-26 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/ClimbLab
下载链接
链接失效反馈
官方服务:
资源简介:
<div align="center"> <span style="font-family: default; font-size: 1.5em;">ClimbLab Dataset</span> <div> 🚀 Creating the highest-quality pre-training datasets for LLMs 🌟 </div> </div> <div style="display: flex; gap: 10px; margin-top: 15px; justify-content: center;"> <a href="https://arxiv.org/abs/2504.13161" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;"> 📄 PAPER </a> <a href="https://huggingface.co/datasets/nvidia/ClimbLab" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;"> 🤗 CLIMBLAB </a> <a href="https://huggingface.co/datasets/nvidia/ClimbMix" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;"> 🤗 CLIMBMIX </a> <a href="https://research.nvidia.com/labs/lpr/climb/" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;"> 🏠 HOMEPAGE </a> </div> <table> <tr> <td align="center"> <img src="assets/cont_pretrain.png" width="300"/><br/> <sub><b>Figure 1:</b> Continuously training a 1B model yields a 2.0% improvement over Llama-3.2-1B, demonstrating a more efficient scaling trend compared to prior models. </sub> </td> <td align="center"> <img src="assets/pretrain_from_scratch.png" width="360"/><br/> <sub><b>Figure 2:</b> Pre-training a 1B model from scratch on ClimbMix shows better scaling effects than training on other datasets. </sub> </td> </tr> </table> ## Dataset Description: ClimbLab is a filtered 1.2-trillion-token corpus with 20 clusters. Based on Nemotron-CC and SmolLM-Corpus, we employed our proposed CLIMB-clustering to semantically reorganize and filter this combined dataset into 20 distinct clusters, leading to a 1.2-trillion-token high-quality corpus. Specifically, we first grouped the data into 1,000 groups based on topic information. Then we applied two classifiers: one to detect advertisements and another to assess the educational value of the text. Each group was scored accordingly, and low-quality data with low scores was removed. This dataset is for research and development only. ## Dataset Details * **Owner(s):** NVIDIA * **Creation Date:** Feb. 1, 2025 * **License/Terms of Use:** CC BY-NC 4.0 * **Intended Usage:** Pre-training language models. * **Format:** Text in parquet format * **Size:** 400 billion tokens * **Data Collection Method:** Automated * **Labeling Method:** Automated ## Usage The ClimbLab dataset we released contains token sequences that have been tokenized using the GPT-2 tokenizer. If you wish to obtain the raw text, please use the provided script `detokenize_climblab.py`. For example: ```bash python detokenize_climblab.py --input_folder <tokenized_folder> --output_folder <raw_text_folder> ``` We also noticed that some community members have converted and released a raw text version of ClimbLab on Hugging Face: https://huggingface.co/datasets/OptimalScale/ClimbLab. You may consider using this version to save the effort of manual conversion. However, please note that this is not the official release, and we are not responsible for the content or maintenance of community-hosted datasets. ## Training We provide an example training script for pre-training a 1B model from scratch with nanoGPT. You may refer to the [ClimbMix](https://huggingface.co/datasets/nvidia/ClimbMix#training) repository for more details. ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Citation: If you find our dataset helpful, please cite the following [paper](https://arxiv.org/abs/2504.13161): ``` @article{diao2025climb, author = {Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Celine Lin and Jan Kautz and Pavlo Molchanov}, title={CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training}, journal = {arXiv preprint}, year = {2025}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url={https://arxiv.org/abs/2504.13161}, } ```

<div align="center"> <span style="font-family: default; font-size: 1.5em;">ClimbLab 数据集</span> <div> 🚀 为大语言模型(LLM)打造最高质量的预训练数据集 🌟 </div> </div> <div style="display: flex; gap: 10px; margin-top: 15px; justify-content: center;"> <a href="https://arxiv.org/abs/2504.13161" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;"> 📄 论文 </a> <a href="https://huggingface.co/datasets/nvidia/ClimbLab" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;"> 🤗 CLIMBLAB </a> <a href="https://huggingface.co/datasets/nvidia/ClimbMix" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;"> 🤗 CLIMBMIX </a> <a href="https://research.nvidia.com/labs/lpr/climb/" style="display: inline-block; background-color: #0d1117; color: white; text-decoration: none; padding: 10px 20px; border-radius: 4px;"> 🏠 主页 </a> </div> <table> <tr> <td align="center"> <img src="assets/cont_pretrain.png" width="300"/><br/> <sub><b>图1:</b>对10亿参数模型进行持续预训练,相比Llama-3.2-1B可实现2.0%的性能提升,展现出比此前模型更高效的缩放趋势。</sub> </td> <td align="center"> <img src="assets/pretrain_from_scratch.png" width="360"/><br/> <sub><b>图2:</b>在ClimbMix上从零开始预训练10亿参数模型,相比其他数据集训练展现出更优异的缩放效果。</sub> </td> </tr> </table> ## 数据集描述: ClimbLab 是一个经过筛选的1.2万亿Token语料库,包含20个聚类。我们基于Nemotron-CC与SmolLM-Corpus,采用提出的CLIMB聚类方法对该合并数据集进行语义重组与筛选,将其划分为20个不同的聚类,最终得到1.2万亿Token的高质量语料库。 具体而言,我们首先基于主题信息将数据划分为1000个组,随后应用两个分类器:一个用于检测广告内容,另一个用于评估文本的教育价值。每个分组将被赋予相应评分,得分较低的低质量数据将被移除。 本数据集仅用于研发用途。 ## 数据集详情 * **所有者**:英伟达(NVIDIA) * **创建日期**:2025年2月1日 * **许可/使用条款**:CC BY-NC 4.0 * **预期用途**:大语言模型(LLM)预训练 * **格式**:Parquet格式文本 * **规模**:4000亿Token * **数据收集方式**:自动化采集 * **标注方式**:自动化标注 ## 使用说明 我们发布的ClimbLab数据集包含使用GPT-2分词器(Tokenizer)进行分词后的Token序列。若需获取原始文本,请使用提供的脚本`detokenize_climblab.py`,示例如下: bash python detokenize_climblab.py --input_folder <tokenized_folder> --output_folder <raw_text_folder> 我们注意到部分社区成员已将ClimbLab的原始文本版本转换并发布至Hugging Face:https://huggingface.co/datasets/OptimalScale/ClimbLab。您可选择使用该版本以省去手动转换的工作量,但请注意这并非官方发布版本,我们不对社区托管数据集的内容或维护负责。 ## 训练示例 我们提供了一个使用nanoGPT从零开始预训练10亿参数模型的示例脚本,更多细节可参阅[ClimbMix](https://huggingface.co/datasets/nvidia/ClimbMix#training)仓库。 ## 伦理考量 英伟达(NVIDIA)认为可信人工智能是一项共同责任,我们已制定相关政策与实践规范,以支持各类人工智能应用的开发。开发者在遵循我们的服务条款下载或使用本数据集时,应与其内部模型团队协作,确保该模型符合相关行业与应用场景的要求,并应对可能出现的产品误用问题。 请[在此](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)报告安全漏洞或英伟达人工智能相关问题。 ## 引用 若您认为本数据集对您的研究有所帮助,请引用以下[论文](https://arxiv.org/abs/2504.13161): @article{diao2025climb, author = {Shizhe Diao and Yu Yang and Yonggan Fu and Xin Dong and Dan Su and Markus Kliegl and Zijia Chen and Peter Belcak and Yoshi Suhara and Hongxu Yin and Mostofa Patwary and Celine Lin and Jan Kautz and Pavlo Molchanov}, title={CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training}, journal = {arXiv preprint}, year = {2025}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url={https://arxiv.org/abs/2504.13161}, }
提供机构:
maas
创建时间:
2025-04-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作