meryyllebr543/pretrain-mix-150b
收藏Hugging Face2025-08-07 更新2025-08-30 收录
下载链接:
https://hf-mirror.com/datasets/meryyllebr543/pretrain-mix-150b
下载链接
链接失效反馈官方服务:
资源简介:
pretrain-mix-150b是一个高质量、1500亿标记的预训练数据集,专为大型语言模型研究和开发而设计。该数据集是一个策略性的混合体,包括高质量的教育网页文本、全面的数学文档和多样化的源代码,旨在培养预训练模型在推理和多领域方面的强大能力。
pretrain-mix-150b is a high-quality, 150-billion-token pre-training dataset meticulously curated for large language model research and development. This dataset is a strategic mix of high-quality educational web text, comprehensive mathematical documents, and a diverse collection of source code, designed to foster strong reasoning and multi-domain capabilities in pre-trained models.
提供机构:
meryyllebr543



