CCI4.0-M2-Extra-v1
收藏魔搭社区2026-05-23 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/BAAI/CCI4.0-M2-Extra-v1
下载链接
链接失效反馈官方服务:
资源简介:
# CCI4.0-M2 v1 Dataset Documentation
<a href="https://arxiv.org/abs/2506.07463"><b>Tech Report</b>👁</a>
## Overview
CCI4.0-M2 v1 is a comprehensive dataset collection consisting of two specialized subsets designed for language model training.
|| CCI4.0-M2-Base v1 | CCI4.0-M2-CoT v1 |
|--|--|--|
|Download Link| [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Base-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Base-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Base-v1) | [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-CoT-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-CoT-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-CoT-v1) |
|Notes| 5.2TB Chinese webpage, 22TB English webpage, some data released in CCI4.0-M2-Extra([BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Extra-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Extra-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Extra-v1)) due to the license concern. | 430 million CoT sample covers math, code, arxiv, wiki and webpage|
The disk storage for different subdomain datasets is shown in the table below:
| DataSets | Lines | Volume (G) |
|------------------|---------------|--------------|
| Web-EN | 7,175,101,435 | 22,498.641 |
| Web-ZH | 1,643,503,909 | 5,161.0895 |
| Code | 215,521,589.3 | 896.57865 |
| Math | 49,685,043.87 | 269.06715 |
| Books | 255,369,254.6 | 858.55473 |
| Wiki | 44,086,649.31 | 96.667659 |
| Arxiv | 1,536,117.66 | 87.12 |
| ForumQA | 28,664,137.6 | 78.00555 |
| pes2o | 4,354,668.22 | 117.14234 |
| CoT_synthesis | 392,470,068.8 | 4,121.978 |
## Subset Specifications
### CCI4.0-M2-Base v1
- **Purpose**: Core pretraining data for general language understanding
- **Data Composition**:
- Chinese: 15% (Including data from cooperation projects and open-source projects)
- English: 85% (primarily sourced from Nemotron-CC and various specific domains like math, code, books etc)
- **Total Volume**: 3000GB
- **Processing**:
- Document-level and phrase-level deduplication
- Rigorous quality filtering through the integration of three quality scores
- Knowledge enhancement via LLM rewriting and generation
- Filtering based on LLM Loss grouped by domain
- PII and toxic filtering
- **License**:
Due to the [license concern](#license-details), we split CCI4.0-M2-Base v1 into 2 datasets.
1. CCI4.0-M2-Base-v1
- For open-source datasets, we selected those with an **Apache-2.0 license**.
- For datasets contributed by various institutions, we conducted **additional license verification**.
- **Nemotron-CC** is subject to the **Common Crawl License**, so we will only release its **metadata** along with our **processed scores**.
2. CCI4.0-M2-Extra-v1
- For data that is open-source but requires independent licensing or involves mixed/composite licenses, we categorize it under this "Extra" dataset.
### CCI4.0-M2-CoT v1
- **Purpose**: Chain-of-Thought reasoning enhancement
- **Total Volume**: 4200GB
- **Special Features**:
- Step-by-step CoT trajactorys
- Detailed question generation
- Multiple domain coverage(e.g., math, code, webpages)
- **License**
Based on the data from these sources, CoT synthesis and instruction synthesis of reverse thinking have been carried out. Due to license considerations, a separate directory will be created to open-source these data.
#### Introduction and Demonstration of Synthesized Chain-of-Thought (CoT)
The Chain-of-Thought (CoT) in the CCI4.0-M2-CoT v1 subset is synthesized to enhance the reasoning capabilities of language models. This synthesis process involves generating step-by-step reasoning trajectories based on various data sources
The following image illustrates the CoT synthesis pipeline:
<img src="CoT_Pipeline.png" alt="CoT_Pipeline" width="400"/>
This pipeline showcases how the raw data from different sources is processed and transformed into structured CoT data, which can be used for training language models to perform complex reasoning tasks.
## License Details
**Disclaimer: If any violations of the dataset usage agreement or licensing terms are identified, we kindly request that you notify us as soon as possible. **
We have organized the agreements for the open-source datasets and confirmed them individually. Below is a list of the main datasets and their corresponding licenses.
| Data Source | Open Source License |
| --- | --- |
| ChineseWebText2.0 | apache-2.0 |
| HPLT2.0_cleaned/zho_Hans | cc0-1.0 |
| TeleChat-PTD | apache-2.0 |
| data from cooperation projects | apache-2.0 |
| Nemotron-CC | Common Crawl License |
| CCI | apache-2.0 |
| --- | --- |
| MAP-CC | CC-BY-NC-ND-4.0 |
| fineweb-2 | ODC-BY |
| wanjuan/data/raw/nlp/CN | CC-BY-4.0 |
| starcoder | Multiple Licenses, see https://huggingface.co/datasets/bigcode/the-stack-dedup/blob/main/licenses.json |
| opc-annealing-corpus | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv |
| smollm-corpu | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv |
| dolma_pes2o_v2 | ODC-BY |
| pes2o | ODC-BY |
| dolma | ODC-BY |
| opc-fineweb-math-corpus | ODC-BY |
| proof-pile-2 | MIT, BSD, or Apache, ODC-By 1.0 license, etc. |
| --- | --- |
| KodCode/KodCode-V1 | cc-by-nc-4.0 |
| facebook/natural_reasoning | cc-by-nc-4.0 |
| allenai/dolma | odc-by |
| allenai/dolmino-mix-1124 | odc-by |
| HuggingFaceTB/finemath | odc-by |
| open-web-math/open-web-math | ODC-By 1.0 |
| allenai/dolmino-mix-1124 | odc-by |
## Acknowledgments
We gratefully acknowledge the valuable contributions of Institutions Alibaba Cloud (阿里云), Shanghai AI Laboratory (上海人工智能实验室), Huawei (华为), Mobvoi (出门问问), Kingsoft Office Software (金山办公), Kunlun (昆仑万维), ModelBest (面壁智能), Qihoo (奇虎科技), Meituan (美团), MiniMax (稀宇科技), Moonshot AI (月之暗面), Zidong Taichu (紫东太初), Wenge (中科闻歌) and iFLYTEK (科大讯飞) in providing the Chinese data.
## Usage Agreement
Users need to comply with the usage agreement of the CCI dataset. You can view the agreement by clicking on the following link: ([View Usage Agreement](https://data.baai.ac.cn/resources/agreement/cci_usage_aggrement.pdf)).
## Citation
Please cite using:
```
@misc{liu2025cci40bilingualpretrainingdataset,
title={CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models},
author={Guang Liu and Liangdong Wang and Jijie Li and Yang Yu and Yao Xu and Jiabei Chen and Yu Bai and Feng Liao and Yonghua Lin},
year={2025},
eprint={2506.07463},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.07463},
}
```
# CCI4.0-M2 v1 数据集文档
<a href="https://arxiv.org/abs/2506.07463"><b>技术报告</b>👁</a>
## 概览
CCI4.0-M2 v1是一套综合性数据集集合,包含两个专门子集,专为大语言模型(Large Language Model,LLM)训练设计。
|| CCI4.0-M2-Base v1 | CCI4.0-M2-CoT v1 |
|--|--|--|
|下载链接| [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Base-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Base-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Base-v1) | [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-CoT-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-CoT-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-CoT-v1) |
|说明| 5.2TB中文网页数据、22TB英文网页数据,部分数据因授权问题已发布至CCI4.0-M2-Extra数据集([BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Extra-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Extra-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Extra-v1))。| 4.3亿个思维链(Chain-of-Thought,CoT)样本,覆盖数学、代码、arXiv、维基百科及网页领域|
不同子域数据集的磁盘存储情况如下表所示:
| 数据集 | 数据行数 | 容量(GB) |
|------------------|---------------|--------------|
| Web-EN | 7,175,101,435 | 22,498.641 |
| Web-ZH | 1,643,503,909 | 5,161.0895 |
| 代码 | 215,521,589.3 | 896.57865 |
| 数学 | 49,685,043.87 | 269.06715 |
| 图书 | 255,369,254.6 | 858.55473 |
| 维基百科 | 44,086,649.31 | 96.667659 |
| arXiv | 1,536,117.66 | 87.12 |
| 论坛问答(ForumQA) | 28,664,137.6 | 78.00555 |
| pes2o | 4,354,668.22 | 117.14234 |
| CoT_synthesis | 392,470,068.8 | 4,121.978 |
## 子集规格说明
### CCI4.0-M2-Base v1
- **用途**:通用语言理解的核心预训练数据
- **数据构成**:
- 中文:15%(包含合作项目及开源项目数据)
- 英文:85%(主要源自Nemotron-CC及数学、代码、图书等多个特定领域)
- **总容量**:3000GB
- **处理流程**:
- 文档级与短语级去重
- 整合三类质量评分进行严格的质量过滤
- 通过大语言模型(Large Language Model,LLM)重写与生成实现知识增强
- 基于领域分组的大语言模型损失值进行过滤
- 个人身份信息(Personally Identifiable Information,PII)与有害内容过滤
- **授权说明**:
由于授权问题,我们将CCI4.0-M2-Base v1拆分为两个数据集。
1. CCI4.0-M2-Base-v1
- 对于开源数据集,我们选择了采用**Apache-2.0许可证**的资源。
- 对于各机构贡献的数据集,我们进行了**额外的授权验证**。
- **Nemotron-CC**受**通用爬虫许可证(Common Crawl License)**约束,因此我们仅发布其**元数据**及我们的**处理后评分**。
2. CCI4.0-M2-Extra-v1
- 对于开源但需独立授权或涉及混合/复合许可证的数据,我们将其归类至该“Extra”数据集。
### CCI4.0-M2-CoT v1
- **用途**:思维链(Chain-of-Thought,CoT)推理能力增强
- **总容量**:4200GB
- **特殊特性**:
- 分步式思维链轨迹
- 精细化的问题生成
- 多领域覆盖(如数学、代码、网页)
- **授权说明**:
基于这些来源的数据,我们开展了逆向思维的思维链合成与指令合成工作。出于授权考量,我们将创建独立目录以开源此类数据。
#### 合成思维链(CoT)简介与演示
思维链(Chain-of-Thought,CoT)在CCI4.0-M2-CoT v1子集内均为合成生成,用于增强大语言模型的推理能力。该合成流程基于多源原始数据生成分步式推理轨迹。
下图展示了思维链合成流水线:
<img src="CoT_Pipeline.png" alt="CoT_Pipeline" width="400"/>
该流水线展示了如何将多源原始数据处理并转换为结构化的思维链数据,可用于训练大语言模型以完成复杂推理任务。
## 授权细节
**免责声明:若发现任何违反数据集使用协议或授权条款的行为,请尽快通知我们。**
我们已整理开源数据集的授权协议并逐一确认。以下为主要数据集及其对应授权类型列表。
| 数据来源 | 开源许可证 |
| --- | --- |
| ChineseWebText2.0 | apache-2.0 |
| HPLT2.0_cleaned/zho_Hans | cc0-1.0 |
| TeleChat-PTD | apache-2.0 |
| 合作项目数据 | apache-2.0 |
| Nemotron-CC | Common Crawl License |
| CCI | apache-2.0 |
| --- | --- |
| MAP-CC | CC-BY-NC-ND-4.0 |
| fineweb-2 | ODC-BY |
| wanjuan/data/raw/nlp/CN | CC-BY-4.0 |
| starcoder | 多许可证,详见https://huggingface.co/datasets/bigcode/the-stack-dedup/blob/main/licenses.json |
| opc-annealing-corpus | 多协议。部分语料源自the-stack-v2,详见https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv |
| smollm-corpu | 多协议。部分语料源自the-stack-v2,详见https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv |
| dolma_pes2o_v2 | ODC-BY |
| pes2o | ODC-BY |
| dolma | ODC-BY |
| opc-fineweb-math-corpus | ODC-BY |
| proof-pile-2 | MIT、BSD或Apache许可证,ODC-By 1.0许可证等 |
| --- | --- |
| KodCode/KodCode-V1 | cc-by-nc-4.0 |
| facebook/natural_reasoning | cc-by-nc-4.0 |
| allenai/dolma | odc-by |
| allenai/dolmino-mix-1124 | odc-by |
| HuggingFaceTB/finemath | odc-by |
| open-web-math/open-web-math | ODC-By 1.0 |
| allenai/dolmino-mix-1124 | odc-by |
## 致谢
我们衷心感谢阿里云(Alibaba Cloud)、上海人工智能实验室(Shanghai AI Laboratory)、华为(Huawei)、出门问问(Mobvoi)、金山办公(Kingsoft Office Software)、昆仑万维(Kunlun)、面壁智能(ModelBest)、奇虎科技(Qihoo)、美团(Meituan)、稀宇科技(MiniMax)、月之暗面(Moonshot AI)、紫东太初(Zidong Taichu)、中科闻歌(Wenge)及科大讯飞(iFLYTEK)等机构为提供中文数据所做出的宝贵贡献。
## 使用协议
用户需遵守CCI数据集的使用协议。您可通过以下链接查看协议:([查看使用协议](https://data.baai.ac.cn/resources/agreement/cci_usage_aggrement.pdf))。
## 引用格式
请使用以下格式引用:
@misc{liu2025cci40bilingualpretrainingdataset,
title={CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models},
author={Guang Liu and Liangdong Wang and Jijie Li and Yang Yu and Yao Xu and Jiabei Chen and Yu Bai and Feng Liao and Yonghua Lin},
year={2025},
eprint={2506.07463},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.07463},
}
提供机构:
maas
创建时间:
2025-05-07
搜集汇总
数据集介绍

背景与挑战
背景概述
CCI4.0-M2-Extra-v1是CCI4.0-M2 v1数据集的一个子集,专门用于处理因许可证问题而无法在基础数据集中完全发布的数据。该数据集覆盖网页、代码、数学、书籍等多个领域,总计约1.97TB,采用Apache License 2.0许可证,并经过严格的数据处理流程,包括去重、质量过滤和知识增强。
以上内容由遇见数据集搜集并总结生成



