PCMind-2.1-Kaiyuan-2B
收藏魔搭社区2026-01-06 更新2025-12-20 收录
下载链接:
https://modelscope.cn/datasets/thu-pacman/PCMind-2.1-Kaiyuan-2B
下载链接
链接失效反馈官方服务:
资源简介:
[](LICENSE)
[](https://arxiv.org/abs/2512.07612)
This repository contains the complete pretraining dataset for
[PCMind-v2.1-Kaiyuan-2B](https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B), a leading fully open-source language model.
### Overview
The dataset is organized into **5 training phases**, with all phase datasets open-sourced in this repository. Our training methodology employs domain-specific mixing strategies across five primary domains:
- English: General English text
- Chinese: General Chinese text
- Code: Programming and code-related content
- Math: Mathematical reasoning and problems
- SFT: Supervised fine-tuning data
The phase-wise mixing ratios are as follows, where we primarily classify datasets into 5 domains: English, Chinese, Code, Math, and SFT data.
<center>
<img alt="Overall mixing ratio flow chart" style="width: 50%"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/PRvyzQs-dMLU7T04gUpez.png"/>
</center>
### Phase Structure
The training process uses two distinct sampling strategies:
| Phase | Sampling Strategy | Data Format |
|-------|------------------|-------------|
| **Phase 1-2** | Uniform sampling | Single column: `text` |
| **Phase 3-5** | Curriculum learning | Two columns: `text` (content), `rank` (sample order) |
**Key distinctions:**
- **Phases 1-2**: Uniform data distribution with random sampling
- **Phases 3-5**: Curriculum-based learning with ordered sample progression using the `rank` field
Each phase employs strategically designed mixing ratios across the five domains. The specific composition and ratios are detailed in our [technical report](https://arxiv.org/abs/2512.07612).
<center>
<img alt="Phase 1" style="width: 45%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/Yh-sBtg4phJ-lcv8bOujG.png"/>
<img alt="Phase 2" style="width: 45%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/YvVKVg8HhF9cSLbt0X8nr.png"/>
<br>
<img alt="Phase 3" style="width: 31%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/0yj5UxLLACzfWhC3y_DVj.png"/>
<img alt="Phase 4" style="width: 31%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/ESTlPqVTm09C0KbAaT5VN.png"/>
<img alt="Phase 5" style="width: 31%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/nV49Iw0OS80Ia3opAN1kT.png"/>
</center>
## Reproducing the Dataset
To construct these phase datasets from scratch, refer to the [Kaiyuan-Spark](https://github.com/thu-pacman/Kaiyuan-Spark) repository, which provides comprehensive documentation on the preprocessing pipeline.
## Citation
If you use this dataset, please cite our technical report:
```bibtex
@misc{luo2025pcmind21kaiyuan2btechnicalreport,
title={PCMind-2.1-Kaiyuan-2B Technical Report},
author={Kairong Luo and Zhenbo Sun and Xinyu Shi and Shengqi Chen and Bowen Yu and Yunyi Chen and Chenyi Dang and Hengtao Tao and Hui Wang and Fangming Liu and Kaifeng Lyu and Wenguang Chen},
year={2025},
eprint={2512.07612},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.07612},
}
```
## Resources
- **Model**: [thu-pacman/PCMind-2.1-Kaiyuan-2B](https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B)
- **Preprocessing Pipeline**: [Kaiyuan-Spark](https://github.com/thu-pacman/Kaiyuan-Spark)
- **Technical Report**: [arXiv:2512.07612](https://arxiv.org/abs/2512.07612)
## License
All artifacts (including code, model weights, and training data) of Kaiyuan-2B
are licensed under [Apache-2.0 License](LICENSE) with the following copyright notice:
```text
Copyright 2025 Tsinghua University & Peng Cheng Laboratory
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```
_NOTICE: This dataset constitutes a derivative work of multiple underlying raw datasets.
Users must comply with the applicable license terms of each source dataset._
Please refer to Section B of [our technical report](https://arxiv.org/abs/2512.07612) for details.
[](LICENSE)
[](https://arxiv.org/abs/2512.07612)
本仓库包含领先的全开源大语言模型(Large Language Model,LLM)[PCMind-v2.1-Kaiyuan-2B](https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B)的完整预训练数据集。
### 概述
本数据集共分为**5个训练阶段**,所有阶段的数据集均已在本仓库中开源。我们的训练方法采用针对特定领域的混合策略,覆盖五大核心领域:
- 英语:通用英文文本
- 中文:通用中文文本
- 代码:编程及代码相关内容
- 数学:数学推理与习题
- SFT(监督微调,Supervised Fine-Tuning):监督微调数据
各阶段的混合比例如下,我们将数据集主要划分为上述五大领域:英语、中文、代码、数学及SFT数据。
<center>
<img alt="Overall mixing ratio flow chart" style="width: 50%"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/PRvyzQs-dMLU7T04gUpez.png"/>
</center>
### 阶段结构
训练过程采用两种截然不同的采样策略:
| 阶段 | 采样策略 | 数据格式 |
|-------|------------------|-------------|
| **阶段1-2** | 均匀采样 | 单列格式:`text` |
| **阶段3-5** | 课程学习(Curriculum Learning) | 双列格式:`text`(内容)与`rank`(样本顺序) |
**核心差异如下:**
- **阶段1-2**:采用均匀数据分布与随机采样
- **阶段3-5**:采用基于课程学习的方法,通过`rank`字段实现样本的有序递进
每个阶段均针对五大领域采用精心设计的混合比例,具体的数据集构成与比例详情请参见我们的[技术报告](https://arxiv.org/abs/2512.07612)。
<center>
<img alt="Phase 1" style="width: 45%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/Yh-sBtg4phJ-lcv8bOujG.png"/>
<img alt="Phase 2" style="width: 45%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/YvVKVg8HhF9cSLbt0X8nr.png"/>
<br>
<img alt="Phase 3" style="width: 31%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/0yj5UxLLACzfWhC3y_DVj.png"/>
<img alt="Phase 4" style="width: 31%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/ESTlPqVTm09C0KbAaT5VN.png"/>
<img alt="Phase 5" style="width: 31%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/nV49Iw0OS80Ia3opAN1kT.png"/>
</center>
## 数据集复现
若需从零构建上述阶段数据集,请参考[Kaiyuan-Spark](https://github.com/thu-pacman/Kaiyuan-Spark)仓库,该仓库提供了预处理流程的完整文档说明。
## 引用
若您使用本数据集,请引用我们的技术报告:
bibtex
@misc{luo2025pcmind21kaiyuan2btechnicalreport,
title={PCMind-2.1-Kaiyuan-2B Technical Report},
author={Kairong Luo and Zhenbo Sun and Xinyu Shi and Shengqi Chen and Bowen Yu and Yunyi Chen and Chenyi Dang and Hengtao Tao and Hui Wang and Fangming Liu and Kaifeng Lyu and Wenguang Chen},
year={2025},
eprint={2512.07612},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.07612},
}
## 相关资源
- **模型**:[thu-pacman/PCMind-2.1-Kaiyuan-2B](https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B)
- **预处理流程**:[Kaiyuan-Spark](https://github.com/thu-pacman/Kaiyuan-Spark)
- **技术报告**:[arXiv:2512.07612](https://arxiv.org/abs/2512.07612)
## 许可证
Kaiyuan-2B的所有产物(包括代码、模型权重与训练数据)均采用[Apache-2.0许可证](LICENSE)授权,并附带以下版权声明:
text
Copyright 2025 Tsinghua University & Peng Cheng Laboratory
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
**注意**:本数据集属于多个原始数据集的衍生作品,使用者需遵守各来源数据集的适用许可证条款。
详细信息请参见[我们的技术报告](https://arxiv.org/abs/2512.07612)的B部分。
提供机构:
maas
创建时间:
2025-12-09



