meryyllebr543/pretrain-mix-150b

Name: meryyllebr543/pretrain-mix-150b
Creator: meryyllebr543
Published: 2025-08-07 09:21:33
License: 暂无描述

Hugging Face2025-08-07 更新2025-08-30 收录

下载链接：

https://hf-mirror.com/datasets/meryyllebr543/pretrain-mix-150b

下载链接

链接失效反馈

官方服务：

资源简介：

pretrain-mix-150b是一个高质量、1500亿标记的预训练数据集，专为大型语言模型研究和开发而设计。该数据集是一个策略性的混合体，包括高质量的教育网页文本、全面的数学文档和多样化的源代码，旨在培养预训练模型在推理和多领域方面的强大能力。

pretrain-mix-150b is a high-quality, 150-billion-token pre-training dataset meticulously curated for large language model research and development. This dataset is a strategic mix of high-quality educational web text, comprehensive mathematical documents, and a diverse collection of source code, designed to foster strong reasoning and multi-domain capabilities in pre-trained models.

提供机构：

meryyllebr543

5,000+

优质数据集

54 个

任务类型

进入经典数据集