five

Lxd99/PCMind-2.1-Kaiyuan-2B-phase1-part1-1-0323

收藏
Hugging Face2026-03-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Lxd99/PCMind-2.1-Kaiyuan-2B-phase1-part1-1-0323
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation language: - zh - en tags: - code - math - language - sft size_categories: - n>1T --- [![License](https://img.shields.io/badge/License-Apache-f5de53?&color=f5de53)](LICENSE) [![arXiv-2512.07612](https://img.shields.io/badge/arXiv-2512.07612-b31b1b.svg?style=flat)](https://arxiv.org/abs/2512.07612) This repository contains the complete pretraining dataset for [PCMind-v2.1-Kaiyuan-2B](https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B), a leading fully open-source language model. ### Overview The dataset is organized into **5 training phases**, with all phase datasets open-sourced in this repository. Our training methodology employs domain-specific mixing strategies across five primary domains: - English: General English text - Chinese: General Chinese text - Code: Programming and code-related content - Math: Mathematical reasoning and problems - SFT: Supervised fine-tuning data The phase-wise mixing ratios are as follows, where we primarily classify datasets into 5 domains: English, Chinese, Code, Math, and SFT data. <center> <img alt="Overall mixing ratio flow chart" style="width: 50%" src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/PRvyzQs-dMLU7T04gUpez.png"/> </center> ### Phase Structure The training process uses two distinct sampling strategies: | Phase | Sampling Strategy | Data Format | |-------|------------------|-------------| | **Phase 1-2** | Uniform sampling | Single column: `text` | | **Phase 3-5** | Curriculum learning | Two columns: `text` (content), `rank` (sample order) | **Key distinctions:** - **Phases 1-2**: Uniform data distribution with random sampling - **Phases 3-5**: Curriculum-based learning with ordered sample progression using the `rank` field Each phase employs strategically designed mixing ratios across the five domains. The specific composition and ratios are detailed in our [technical report](https://arxiv.org/abs/2512.07612). <center> <img alt="Phase 1" style="width: 45%; display: inline-block; margin: 1%;" src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/Yh-sBtg4phJ-lcv8bOujG.png"/> <img alt="Phase 2" style="width: 45%; display: inline-block; margin: 1%;" src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/YvVKVg8HhF9cSLbt0X8nr.png"/> <br> <img alt="Phase 3" style="width: 31%; display: inline-block; margin: 1%;" src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/0yj5UxLLACzfWhC3y_DVj.png"/> <img alt="Phase 4" style="width: 31%; display: inline-block; margin: 1%;" src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/ESTlPqVTm09C0KbAaT5VN.png"/> <img alt="Phase 5" style="width: 31%; display: inline-block; margin: 1%;" src="https://cdn-uploads.huggingface.co/production/uploads/64094eb49e9f790c905a3a59/nV49Iw0OS80Ia3opAN1kT.png"/> </center> ## Reproducing the Dataset To construct these phase datasets from scratch, refer to the [Kaiyuan-Spark](https://github.com/thu-pacman/Kaiyuan-Spark) repository, which provides comprehensive documentation on the preprocessing pipeline. ## Citation If you use this dataset, please cite our technical report: ```bibtex @misc{luo2025pcmind21kaiyuan2btechnicalreport, title={PCMind-2.1-Kaiyuan-2B Technical Report}, author={Kairong Luo and Zhenbo Sun and Xinyu Shi and Shengqi Chen and Bowen Yu and Yunyi Chen and Chenyi Dang and Hengtao Tao and Hui Wang and Fangming Liu and Kaifeng Lyu and Wenguang Chen}, year={2025}, eprint={2512.07612}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.07612}, } ``` ## Resources - **Model**: [thu-pacman/PCMind-2.1-Kaiyuan-2B](https://huggingface.co/thu-pacman/PCMind-2.1-Kaiyuan-2B) - **Preprocessing Pipeline**: [Kaiyuan-Spark](https://github.com/thu-pacman/Kaiyuan-Spark) - **Technical Report**: [arXiv:2512.07612](https://arxiv.org/abs/2512.07612) ## License All artifacts (including code, model weights, and training data) of Kaiyuan-2B are licensed under [Apache-2.0 License](LICENSE) with the following copyright notice: ```text Copyright 2025 Tsinghua University & Peng Cheng Laboratory Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ``` _NOTICE: This dataset constitutes a derivative work of multiple underlying raw datasets. Users must comply with the applicable license terms of each source dataset._ Please refer to Section B of [our technical report](https://arxiv.org/abs/2512.07612) for details.
提供机构:
Lxd99
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作