five

CCI4.0-M2-CoT-v1

收藏
魔搭社区2026-05-23 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/BAAI/CCI4.0-M2-CoT-v1
下载链接
链接失效反馈
官方服务:
资源简介:
# CCI4.0-M2 v1 Dataset Documentation <a href="https://arxiv.org/abs/2506.07463"><b>Tech Report</b>👁</a> ## Overview CCI4.0-M2 v1 is a comprehensive dataset collection consisting of two specialized subsets designed for language model training. || CCI4.0-M2-Base v1 | CCI4.0-M2-CoT v1 | |--|--|--| |Download Link| [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Base-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Base-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Base-v1) | [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-CoT-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-CoT-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-CoT-v1) | |Notes| 5.2TB Chinese webpage, 22TB English webpage, some data released in CCI4.0-M2-Extra([BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Extra-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Extra-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Extra-v1)) due to the license concern. | 430 million CoT sample covers math, code, arxiv, wiki and webpage| The disk storage for different subdomain datasets is shown in the table below: | DataSets | Lines | Volume (G) | |------------------|---------------|--------------| | Web-EN | 7,175,101,435 | 22,498.641 | | Web-ZH | 1,643,503,909 | 5,161.0895 | | Code | 215,521,589.3 | 896.57865 | | Math | 49,685,043.87 | 269.06715 | | Books | 255,369,254.6 | 858.55473 | | Wiki | 44,086,649.31 | 96.667659 | | Arxiv | 1,536,117.66 | 87.12 | | ForumQA | 28,664,137.6 | 78.00555 | | pes2o | 4,354,668.22 | 117.14234 | | CoT_synthesis | 392,470,068.8 | 4,121.978 | ## Subset Specifications ### CCI4.0-M2-Base v1 - **Purpose**: Core pretraining data for general language understanding - **Data Composition**: - Chinese: 15% (Including data from cooperation projects and open-source projects) - English: 85% (primarily sourced from Nemotron-CC and various specific domains like math, code, books etc) - **Total Volume**: 3000GB - **Processing**: - Document-level and phrase-level deduplication - Rigorous quality filtering through the integration of three quality scores - Knowledge enhancement via LLM rewriting and generation - Filtering based on LLM Loss grouped by domain - PII and toxic filtering - **License**: Due to the [license concern](#license-details), we split CCI4.0-M2-Base v1 into 2 datasets. 1. CCI4.0-M2-Base-v1 - For open-source datasets, we selected those with an **Apache-2.0 license**. - For datasets contributed by various institutions, we conducted **additional license verification**. - **Nemotron-CC** is subject to the **Common Crawl License**, so we will only release its **metadata** along with our **processed scores**. 2. CCI4.0-M2-Extra-v1 - For data that is open-source but requires independent licensing or involves mixed/composite licenses, we categorize it under this "Extra" dataset. ### CCI4.0-M2-CoT v1 - **Purpose**: Chain-of-Thought reasoning enhancement - **Total Volume**: 4200GB - **Special Features**: - Step-by-step CoT trajactorys - Detailed question generation - Multiple domain coverage(e.g., math, code, webpages) - **License** Based on the data from these sources, CoT synthesis and instruction synthesis of reverse thinking have been carried out. Due to license considerations, a separate directory will be created to open-source these data. #### Introduction and Demonstration of Synthesized Chain-of-Thought (CoT) The Chain-of-Thought (CoT) in the CCI4.0-M2-CoT v1 subset is synthesized to enhance the reasoning capabilities of language models. This synthesis process involves generating step-by-step reasoning trajectories based on various data sources The following image illustrates the CoT synthesis pipeline: <img src="CoT_Pipeline.png" alt="CoT_Pipeline" width="400"/> This pipeline showcases how the raw data from different sources is processed and transformed into structured CoT data, which can be used for training language models to perform complex reasoning tasks. ## License Details **Disclaimer: If any violations of the dataset usage agreement or licensing terms are identified, we kindly request that you notify us as soon as possible. ** We have organized the agreements for the open-source datasets and confirmed them individually. Below is a list of the main datasets and their corresponding licenses. | Data Source | Open Source License | | --- | --- | | ChineseWebText2.0 | apache-2.0 | | HPLT2.0_cleaned/zho_Hans | cc0-1.0 | | TeleChat-PTD | apache-2.0 | | data from cooperation projects | apache-2.0 | | Nemotron-CC | Common Crawl License | | CCI | apache-2.0 | | --- | --- | | MAP-CC | CC-BY-NC-ND-4.0 | | fineweb-2 | ODC-BY | | wanjuan/data/raw/nlp/CN | CC-BY-4.0 | | starcoder | Multiple Licenses, see https://huggingface.co/datasets/bigcode/the-stack-dedup/blob/main/licenses.json | | opc-annealing-corpus | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv | | smollm-corpu | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv | | dolma_pes2o_v2 | ODC-BY | | pes2o | ODC-BY | | dolma | ODC-BY | | opc-fineweb-math-corpus | ODC-BY | | proof-pile-2 | MIT, BSD, or Apache, ODC-By 1.0 license, etc. | | --- | --- | | KodCode/KodCode-V1 | cc-by-nc-4.0 | | facebook/natural_reasoning | cc-by-nc-4.0 | | allenai/dolma | odc-by | | allenai/dolmino-mix-1124 | odc-by | | HuggingFaceTB/finemath | odc-by | | open-web-math/open-web-math | ODC-By 1.0 | | allenai/dolmino-mix-1124 | odc-by | ## Acknowledgments We gratefully acknowledge the valuable contributions of Institutions Alibaba Cloud (阿里云), Shanghai AI Laboratory (上海人工智能实验室), Huawei (华为), Mobvoi (出门问问), Kingsoft Office Software (金山办公), Kunlun (昆仑万维), ModelBest (面壁智能), Qihoo (奇虎科技), Meituan (美团), MiniMax (稀宇科技), Moonshot AI (月之暗面), Zidong Taichu (紫东太初), Wenge (中科闻歌) and iFLYTEK (科大讯飞) in providing the Chinese data. ## Usage Agreement Users need to comply with the usage agreement of the CCI dataset. You can view the agreement by clicking on the following link: ([View Usage Agreement](https://data.baai.ac.cn/resources/agreement/cci_usage_aggrement.pdf)). ## Citation Please cite using: ``` @misc{liu2025cci40bilingualpretrainingdataset, title={CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models}, author={Guang Liu and Liangdong Wang and Jijie Li and Yang Yu and Yao Xu and Jiabei Chen and Yu Bai and Feng Liao and Yonghua Lin}, year={2025}, eprint={2506.07463}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.07463}, } ```

# CCI4.0-M2 v1 数据集文档 <a href="https://arxiv.org/abs/2506.07463"><b>技术报告</b>👁</a> ## 概述 CCI4.0-M2 v1 是一套综合性数据集集合,包含两个专为大语言模型(Large Language Model, LLM)训练设计的专业子集。 | | CCI4.0-M2-Base v1 | CCI4.0-M2-CoT v1 | |--|--|--| | 下载链接 | [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Base-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Base-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Base-v1) | [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-CoT-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-CoT-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-CoT-v1) | | 说明 | 包含5.2TB 中文网页数据、22TB 英文网页数据;部分数据因授权协议问题,已在CCI4.0-M2-Extra数据集发布([BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Extra-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Extra-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Extra-v1))。 | 包含4.3亿个思维链(Chain-of-Thought, CoT)样本,覆盖数学、代码、arxiv论文、维基百科及网页数据 | 各子域数据集的磁盘存储情况如下表所示: | 数据集名称 | 数据行数 | 存储容量(GB) | |------------------|---------------|--------------| | Web-EN | 7,175,101,435 | 22,498.641 | | Web-ZH | 1,643,503,909 | 5,161.0895 | | Code | 215,521,589.3 | 896.57865 | | Math | 49,685,043.87 | 269.06715 | | Books | 255,369,254.6 | 858.55473 | | Wiki | 44,086,649.31 | 96.667659 | | Arxiv | 1,536,117.66 | 87.12 | | ForumQA | 28,664,137.6 | 78.00555 | | pes2o | 4,354,668.22 | 117.14234 | | CoT_synthesis | 392,470,068.8 | 4,121.978 | ## 子集规格说明 ### CCI4.0-M2-Base v1 - **用途**:通用语言理解的核心预训练数据 - **数据构成**: - 中文:15%(包含合作项目及开源项目数据) - 英文:85%(主要源自Nemotron-CC及数学、代码、图书等多个特定领域) - **总存储容量**:3000GB - **数据处理流程**: - 文档级与短语级去重 - 整合三类质量评分进行严格的质量过滤 - 通过大语言模型(Large Language Model, LLM)重写与生成实现知识增强 - 基于领域分组的大语言模型损失值进行筛选 - 个人身份信息(Personally Identifiable Information, PII)与有害内容过滤 - **授权协议**: 因[授权协议问题](#license-details),我们将CCI4.0-M2-Base v1拆分为两个数据集: 1. CCI4.0-M2-Base-v1 - 开源数据集选用**Apache-2.0**授权协议 - 各机构贡献的数据集均经过**额外授权验证** - **Nemotron-CC**受**Common Crawl License**约束,因此仅发布其**元数据**与我们的**处理后评分** 2. CCI4.0-M2-Extra-v1 - 对于需独立授权或包含混合/复合授权的开源数据,归类至该“Extra”数据集 ### CCI4.0-M2-CoT v1 - **用途**:增强思维链推理能力 - **总存储容量**:4200GB - **特殊特性**: - 分步式思维链(Chain-of-Thought, CoT)轨迹 - 精细化问题生成 - 覆盖多领域(如数学、代码、网页数据) - **授权协议** 基于上述数据源的数据,已完成思维链合成与逆向思维指令合成。考虑到授权协议要求,将创建独立目录开源此类数据。 #### 合成思维链(Chain-of-Thought, CoT)介绍与演示 CCI4.0-M2-CoT v1子集中的思维链数据为合成数据,旨在增强语言模型的推理能力。该合成流程基于各类数据源生成分步式推理轨迹。 下图展示了思维链合成流水线: <img src="CoT_Pipeline.png" alt="CoT_Pipeline" width="400"/> 该流水线展示了如何将不同来源的原始数据处理并转换为结构化的思维链数据,可用于训练语言模型完成复杂推理任务。 ## 授权协议详情 **免责声明:若发现本数据集存在违反使用协议或授权条款的情况,请尽快通知我们。** 我们已整理各开源数据集的授权协议并逐一确认。以下为主要数据集及其对应授权协议列表: | 数据来源 | 开源授权协议 | | --- | --- | | ChineseWebText2.0 | apache-2.0 | | HPLT2.0_cleaned/zho_Hans | cc0-1.0 | | TeleChat-PTD | apache-2.0 | | data from cooperation projects | apache-2.0 | | Nemotron-CC | Common Crawl License | | CCI | apache-2.0 | | --- | --- | | MAP-CC | CC-BY-NC-ND-4.0 | | fineweb-2 | ODC-BY | | wanjuan/data/raw/nlp/CN | CC-BY-4.0 | | starcoder | Multiple Licenses, see https://huggingface.co/datasets/bigcode/the-stack-dedup/blob/main/licenses.json | | opc-annealing-corpus | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv | | smollm-corpu | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv | | dolma_pes2o_v2 | ODC-BY | | pes2o | ODC-BY | | dolma | ODC-BY | | opc-fineweb-math-corpus | ODC-BY | | proof-pile-2 | MIT, BSD, or Apache, ODC-By 1.0 license, etc. | | --- | --- | | KodCode/KodCode-V1 | cc-by-nc-4.0 | | facebook/natural_reasoning | cc-by-nc-4.0 | | allenai/dolma | odc-by | | allenai/dolmino-mix-1124 | odc-by | | HuggingFaceTB/finemath | odc-by | | open-web-math/open-web-math | ODC-By 1.0 | | allenai/dolmino-mix-1124 | odc-by | ## 致谢 我们衷心感谢阿里巴巴云计算(阿里云)、上海人工智能实验室、华为、出门问问、金山办公软件、昆仑万维、面壁智能、奇虎科技、美团、稀宇科技(MiniMax)、月之暗面(Moonshot AI)、紫东太初(Zidong Taichu)、中科闻歌(Wenge)及科大讯飞(iFLYTEK)等机构为中文数据提供的宝贵支持。 ## 使用协议 用户需遵守CCI数据集的使用协议。您可通过以下链接查看协议:([查看使用协议](https://data.baai.ac.cn/resources/agreement/cci_usage_aggrement.pdf))。 ## 引用方式 请使用以下格式引用本数据集: @misc{liu2025cci40bilingualpretrainingdataset, title={CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models}, author={Guang Liu and Liangdong Wang and Jijie Li and Yang Yu and Yao Xu and Jiabei Chen and Yu Bai and Feng Liao and Yonghua Lin}, year={2025}, eprint={2506.07463}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.07463}, }
提供机构:
maas
创建时间:
2025-05-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作