five

CCI4.0-M2-Base-v1

收藏
魔搭社区2026-05-23 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/BAAI/CCI4.0-M2-Base-v1
下载链接
链接失效反馈
官方服务:
资源简介:
# CCI4.0-M2 v1 Dataset Documentation <a href="https://arxiv.org/abs/2506.07463"><b>Tech Report</b>👁</a> ## Overview CCI4.0-M2 v1 is a comprehensive dataset collection consisting of two specialized subsets designed for language model training. || CCI4.0-M2-Base v1 | CCI4.0-M2-CoT v1 | |--|--|--| |Download Link| [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Base-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Base-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Base-v1) | [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-CoT-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-CoT-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-CoT-v1) | |Notes| 5.2TB Chinese webpage, 22TB English webpage, some data released in CCI4.0-M2-Extra([BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Extra-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI/CCI4.0-M2-Extra-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Extra-v1)) due to the license concern. | 430 million CoT sample covers math, code, arxiv, wiki and webpage| The disk storage for different subdomain datasets is shown in the table below: | DataSets | Lines | Volume (G) | |------------------|---------------|--------------| | Web-EN | 7,175,101,435 | 22,498.641 | | Web-ZH | 1,643,503,909 | 5,161.0895 | | Code | 215,521,589.3 | 896.57865 | | Math | 49,685,043.87 | 269.06715 | | Books | 255,369,254.6 | 858.55473 | | Wiki | 44,086,649.31 | 96.667659 | | Arxiv | 1,536,117.66 | 87.12 | | ForumQA | 28,664,137.6 | 78.00555 | | pes2o | 4,354,668.22 | 117.14234 | | CoT_synthesis | 392,470,068.8 | 4,121.978 | ## Subset Specifications ### CCI4.0-M2-Base v1 - **Purpose**: Core pretraining data for general language understanding - **Data Composition**: - Chinese: 15% (Including data from cooperation projects and open-source projects) - English: 85% (primarily sourced from Nemotron-CC and various specific domains like math, code, books etc) - **Total Volume**: 3000GB - **Processing**: - Document-level and phrase-level deduplication - Rigorous quality filtering through the integration of three quality scores - Knowledge enhancement via LLM rewriting and generation - Filtering based on LLM Loss grouped by domain - PII and toxic filtering - **License**: Due to the [license concern](#license-details), we split CCI4.0-M2-Base v1 into 2 datasets. 1. CCI4.0-M2-Base-v1 - For open-source datasets, we selected those with an **Apache-2.0 license**. - For datasets contributed by various institutions, we conducted **additional license verification**. - **Nemotron-CC** is subject to the **Common Crawl License**, so we will only release its **metadata** along with our **processed scores**. 2. CCI4.0-M2-Extra-v1 - For data that is open-source but requires independent licensing or involves mixed/composite licenses, we categorize it under this "Extra" dataset. ### CCI4.0-M2-CoT v1 - **Purpose**: Chain-of-Thought reasoning enhancement - **Total Volume**: 4200GB - **Special Features**: - Step-by-step CoT trajactorys - Detailed question generation - Multiple domain coverage(e.g., math, code, webpages) - **License** Based on the data from these sources, CoT synthesis and instruction synthesis of reverse thinking have been carried out. Due to license considerations, a separate directory will be created to open-source these data. #### Introduction and Demonstration of Synthesized Chain-of-Thought (CoT) The Chain-of-Thought (CoT) in the CCI4.0-M2-CoT v1 subset is synthesized to enhance the reasoning capabilities of language models. This synthesis process involves generating step-by-step reasoning trajectories based on various data sources The following image illustrates the CoT synthesis pipeline: <img src="CoT_Pipeline.png" alt="CoT_Pipeline" width="400"/> This pipeline showcases how the raw data from different sources is processed and transformed into structured CoT data, which can be used for training language models to perform complex reasoning tasks. ## License Details **Disclaimer: If any violations of the dataset usage agreement or licensing terms are identified, we kindly request that you notify us as soon as possible. ** We have organized the agreements for the open-source datasets and confirmed them individually. Below is a list of the main datasets and their corresponding licenses. | Data Source | Open Source License | | --- | --- | | ChineseWebText2.0 | apache-2.0 | | HPLT2.0_cleaned/zho_Hans | cc0-1.0 | | TeleChat-PTD | apache-2.0 | | data from cooperation projects | apache-2.0 | | Nemotron-CC | Common Crawl License | | CCI | apache-2.0 | | --- | --- | | MAP-CC | CC-BY-NC-ND-4.0 | | fineweb-2 | ODC-BY | | wanjuan/data/raw/nlp/CN | CC-BY-4.0 | | starcoder | Multiple Licenses, see https://huggingface.co/datasets/bigcode/the-stack-dedup/blob/main/licenses.json | | opc-annealing-corpus | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv | | smollm-corpu | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv | | dolma_pes2o_v2 | ODC-BY | | pes2o | ODC-BY | | dolma | ODC-BY | | opc-fineweb-math-corpus | ODC-BY | | proof-pile-2 | MIT, BSD, or Apache, ODC-By 1.0 license, etc. | | --- | --- | | KodCode/KodCode-V1 | cc-by-nc-4.0 | | facebook/natural_reasoning | cc-by-nc-4.0 | | allenai/dolma | odc-by | | allenai/dolmino-mix-1124 | odc-by | | HuggingFaceTB/finemath | odc-by | | open-web-math/open-web-math | ODC-By 1.0 | | allenai/dolmino-mix-1124 | odc-by | ## Acknowledgments We gratefully acknowledge the valuable contributions of Institutions Alibaba Cloud (阿里云), Shanghai AI Laboratory (上海人工智能实验室), Huawei (华为), Mobvoi (出门问问), Kingsoft Office Software (金山办公), Kunlun (昆仑万维), ModelBest (面壁智能), Qihoo (奇虎科技), Meituan (美团), MiniMax (稀宇科技), Moonshot AI (月之暗面), Zidong Taichu (紫东太初), Wenge (中科闻歌) and iFLYTEK (科大讯飞) in providing the Chinese data. ## Usage Agreement Users need to comply with the usage agreement of the CCI dataset. You can view the agreement by clicking on the following link: ([View Usage Agreement](https://data.baai.ac.cn/resources/agreement/cci_usage_aggrement.pdf)). ## Citation Please cite using: ``` @misc{liu2025cci40bilingualpretrainingdataset, title={CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models}, author={Guang Liu and Liangdong Wang and Jijie Li and Yang Yu and Yao Xu and Jiabei Chen and Yu Bai and Feng Liao and Yonghua Lin}, year={2025}, eprint={2506.07463}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.07463}, } ```

# CCI4.0-M2 v1 数据集文档 <a href="https://arxiv.org/abs/2506.07463"><b>技术报告</b>👁</a> ## 概述 CCI4.0-M2 v1 是一套综合性数据集集合,包含两个专为大语言模型(Large Language Model,LLM)训练打造的专用子集。 | | CCI4.0-M2-Base v1 | CCI4.0-M2-CoT v1 | |--|--|--| | 下载链接 | [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Base-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI-CCI4.0-M2-Base-v1) / [hf](https://huggingface.co/datasets/BAAI-CCI4.0-M2-Base-v1) | [BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-CoT-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI-CCI4.0-M2-CoT-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-CoT-v1) | | 说明 | 包含5.2TB 中文网页数据、22TB 英文网页数据;因授权合规问题,部分数据已通过CCI4.0-M2-Extra子集发布([BAAI_datahub](https://data.baai.ac.cn/datadetail/BAAI-CCI4.0-M2-Extra-v1) / [modelscope](https://www.modelscope.cn/datasets/BAAI-CCI4.0-M2-Extra-v1) / [hf](https://huggingface.co/datasets/BAAI/CCI4.0-M2-Extra-v1))。 | 包含4.3亿个思维链(Chain-of-Thought,CoT)样本,覆盖数学、代码、arxiv论文、维基百科及网页数据。 | 各子域数据集的磁盘存储情况如下表所示: | 数据集 | 数据行数 | 容量(GB) | |------------------|---------------|--------------| | Web-EN | 7,175,101,435 | 22,498.641 | | Web-ZH | 1,643,503,909 | 5,161.0895 | | Code | 215,521,589.3 | 896.57865 | | Math | 49,685,043.87 | 269.06715 | | Books | 255,369,254.6 | 858.55473 | | Wiki | 44,086,649.31 | 96.667659 | | Arxiv | 1,536,117.66 | 87.12 | | ForumQA | 28,664,137.6 | 78.00555 | | pes2o | 4,354,668.22 | 117.14234 | | CoT_synthesis | 392,470,068.8 | 4,121.978 | ## 子集规格说明 ### CCI4.0-M2-Base v1 - **用途**:通用语言理解的核心预训练数据 - **数据构成**: - 中文数据:15%(包含合作项目与开源项目提供的数据) - 英文数据:85%(主要源自Nemotron-CC 以及数学、代码、图书等多个垂直领域数据) - **总容量**:3000GB - **处理流程**: - 文档级与短语级去重 - 融合三类质量评分开展严格的质量过滤 - 通过大语言模型(Large Language Model,LLM)改写与生成实现知识增强 - 基于按域分组的大语言模型损失值进行过滤 - 个人可识别信息(Personally Identifiable Information,PII)与有害内容过滤 - **授权协议**: 鉴于授权合规问题,我们将CCI4.0-M2-Base v1拆分为两个数据集: 1. CCI4.0-M2-Base-v1 - 对于开源数据集,仅选用**Apache-2.0许可协议**的资源 - 对于各机构贡献的数据集,我们开展了**额外的授权合规验证** - **Nemotron-CC** 受**Common Crawl许可协议**约束,因此仅发布其**元数据**与我们生成的**处理后评分** 2. CCI4.0-M2-Extra-v1 - 对于开源但需单独授权,或涉及混合/复合许可协议的数据,我们将其归类至该“Extra”子集中。 ### CCI4.0-M2-CoT v1 - **用途**:思维链(Chain-of-Thought,CoT)推理能力增强 - **总容量**:4200GB - **特殊特性**: - 分步式思维链轨迹 - 精细化问题生成 - 覆盖多垂直领域(如数学、代码、网页数据) - **授权协议**: 基于上述数据源开展了思维链合成与反向思维指令合成工作;出于授权合规考量,我们将通过独立目录开源该部分数据。 #### 合成式思维链(Chain-of-Thought,CoT)介绍与演示 CCI4.0-M2-CoT v1 子集中的思维链数据均为合成生成,旨在提升大语言模型的推理能力。该合成流程基于多源原始数据生成分步式推理轨迹。 下图展示了思维链合成流水线: <img src="CoT_Pipeline.png" alt="CoT_Pipeline" width="400"/> 该流水线展示了如何将多源原始数据处理转换为结构化思维链数据,进而用于训练大语言模型以完成复杂推理任务。 ## 授权协议详情 **免责声明:若您发现本数据集的使用协议或授权条款存在侵权情况,请尽快告知我们。** 我们已整理各开源数据集的授权协议并逐一确认,下表列出了主要数据集及其对应的许可协议: | 数据源 | 开源许可协议 | | --- | --- | | ChineseWebText2.0 | apache-2.0 | | HPLT2.0_cleaned/zho_Hans | cc0-1.0 | | TeleChat-PTD | apache-2.0 | | data from cooperation projects | apache-2.0 | | Nemotron-CC | Common Crawl License | | CCI | apache-2.0 | | --- | --- | | MAP-CC | CC-BY-NC-ND-4.0 | | fineweb-2 | ODC-BY | | wanjuan/data/raw/nlp/CN | CC-BY-4.0 | | starcoder | Multiple Licenses, see https://huggingface.co/datasets/bigcode/the-stack-dedup/blob/main/licenses.json | | opc-annealing-corpus | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv | | smollm-corpu | Multiple agreements. Some corpora are from the-stack-v2. See agreements at: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/license_stats.csv | | dolma_pes2o_v2 | ODC-BY | | pes2o | ODC-BY | | dolma | ODC-BY | | opc-fineweb-math-corpus | ODC-BY | | proof-pile-2 | MIT, BSD, or Apache, ODC-By 1.0 license, etc. | | --- | --- | | KodCode/KodCode-V1 | cc-by-nc-4.0 | | facebook/natural_reasoning | cc-by-nc-4.0 | | allenai/dolma | odc-by | | allenai/dolmino-mix-1124 | odc-by | | HuggingFaceTB/finemath | odc-by | | open-web-math/open-web-math | ODC-By 1.0 | | allenai/dolmino-mix-1124 | odc-by | ## 致谢 我们衷心感谢以下机构为本次中文数据提供的宝贵支持:阿里云(Alibaba Cloud)、上海人工智能实验室、华为(Huawei)、出门问问(Mobvoi)、金山办公(Kingsoft Office Software)、昆仑万维(Kunlun)、面壁智能(ModelBest)、奇虎科技(Qihoo)、美团(Meituan)、稀宇科技(MiniMax)、月之暗面(Moonshot AI)、紫东太初(Zidong Taichu)、中科闻歌(Wenge)以及科大讯飞(iFLYTEK)。 ## 使用协议 用户需遵守CCI数据集的使用协议,您可通过以下链接查看协议内容:([查看使用协议](https://data.baai.ac.cn/resources/agreement/cci_usage_aggrement.pdf))。 ## 引用方式 请按以下格式引用: @misc{liu2025cci40bilingualpretrainingdataset, title={CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models}, author={Guang Liu and Liangdong Wang and Jijie Li and Yang Yu and Yao Xu and Jiabei Chen and Yu Bai and Feng Liao and Yonghua Lin}, year={2025}, eprint={2506.07463}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.07463}, }
提供机构:
maas
创建时间:
2025-05-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作