Lxd99/PCMind-2.1-Kaiyuan-2B-phase1-part1-1-0323
收藏Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Lxd99/PCMind-2.1-Kaiyuan-2B-phase1-part1-1-0323
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- zh
- en
tags:
- code
- math
- language
- sft
size_categories:
- n>1T
---
[](LICENSE)
[](https://arxiv.org/abs/2512.07612)
This repository contains the complete pretraining dataset for
[PCMind-v2.1-Kaiyuan-2B](https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B), a leading fully open-source language model.
### Overview
The dataset is organized into **5 training phases**, with all phase datasets open-sourced in this repository. Our training methodology employs domain-specific mixing strategies across five primary domains:
- English: General English text
- Chinese: General Chinese text
- Code: Programming and code-related content
- Math: Mathematical reasoning and problems
- SFT: Supervised fine-tuning data
The phase-wise mixing ratios are as follows, where we primarily classify datasets into 5 domains: English, Chinese, Code, Math, and SFT data.
<center>
<img alt="Overall mixing ratio flow chart" style="width: 50%"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/PRvyzQs-dMLU7T04gUpez.png"/>
</center>
### Phase Structure
The training process uses two distinct sampling strategies:
| Phase | Sampling Strategy | Data Format |
|-------|------------------|-------------|
| **Phase 1-2** | Uniform sampling | Single column: `text` |
| **Phase 3-5** | Curriculum learning | Two columns: `text` (content), `rank` (sample order) |
**Key distinctions:**
- **Phases 1-2**: Uniform data distribution with random sampling
- **Phases 3-5**: Curriculum-based learning with ordered sample progression using the `rank` field
Each phase employs strategically designed mixing ratios across the five domains. The specific composition and ratios are detailed in our [technical report](https://arxiv.org/abs/2512.07612).
<center>
<img alt="Phase 1" style="width: 45%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/Yh-sBtg4phJ-lcv8bOujG.png"/>
<img alt="Phase 2" style="width: 45%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/YvVKVg8HhF9cSLbt0X8nr.png"/>
<br>
<img alt="Phase 3" style="width: 31%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/0yj5UxLLACzfWhC3y_DVj.png"/>
<img alt="Phase 4" style="width: 31%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/ESTlPqVTm09C0KbAaT5VN.png"/>
<img alt="Phase 5" style="width: 31%; display: inline-block; margin: 1%;"
src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/nV49Iw0OS80Ia3opAN1kT.png"/>
</center>
## Reproducing the Dataset
To construct these phase datasets from scratch, refer to the [Kaiyuan-Spark](https://github.com/thu-pacman/Kaiyuan-Spark) repository, which provides comprehensive documentation on the preprocessing pipeline.
## Citation
If you use this dataset, please cite our technical report:
```bibtex
@misc{luo2025pcmind21kaiyuan2btechnicalreport,
title={PCMind-2.1-Kaiyuan-2B Technical Report},
author={Kairong Luo and Zhenbo Sun and Xinyu Shi and Shengqi Chen and Bowen Yu and Yunyi Chen and Chenyi Dang and Hengtao Tao and Hui Wang and Fangming Liu and Kaifeng Lyu and Wenguang Chen},
year={2025},
eprint={2512.07612},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.07612},
}
```
## Resources
- **Model**: [thu-pacman/PCMind-2.1-Kaiyuan-2B](https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B)
- **Preprocessing Pipeline**: [Kaiyuan-Spark](https://github.com/thu-pacman/Kaiyuan-Spark)
- **Technical Report**: [arXiv:2512.07612](https://arxiv.org/abs/2512.07612)
## License
All artifacts (including code, model weights, and training data) of Kaiyuan-2B
are licensed under [Apache-2.0 License](LICENSE) with the following copyright notice:
```text
Copyright 2025 Tsinghua University & Peng Cheng Laboratory
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```
_NOTICE: This dataset constitutes a derivative work of multiple underlying raw datasets.
Users must comply with the applicable license terms of each source dataset._
Please refer to Section B of [our technical report](https://arxiv.org/abs/2512.07612) for details.
提供机构:
Lxd99



