thu-pacman/PCMind-2.1-Kaiyuan-2B
收藏Hugging Face2025-12-12 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/thu-pacman/PCMind-2.1-Kaiyuan-2B
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是PCMind-v2.1-Kaiyuan-2B语言模型的完整预训练数据集,包含5个训练阶段,采用领域特定的混合策略,覆盖英文、中文、代码、数学和SFT(监督微调)五个主要领域。前两个阶段使用均匀采样策略,后三个阶段采用课程学习策略。数据集大小超过1TB,支持中文和英文,采用Apache-2.0许可证。
This dataset is the complete pretraining dataset for the PCMind-v2.1-Kaiyuan-2B language model, organized into 5 training phases with domain-specific mixing strategies across five primary domains: English, Chinese, Code, Math, and SFT (Supervised Fine-Tuning). Phases 1-2 employ uniform sampling, while phases 3-5 use curriculum learning. The dataset exceeds 1TB in size, supports Chinese and English languages, and is licensed under Apache-2.0.
提供机构:
thu-pacman



