five

bluelightai-dev/clt_pretrain_data

收藏
Hugging Face2025-07-11 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/bluelightai-dev/clt_pretrain_data
下载链接
链接失效反馈
官方服务:
资源简介:
Qwen3-Inspired Pre-training Dataset是一个受到Qwen3方法启发的、为大型语言模型预训练而设计的高质量文本数据混合集。该数据集包含了训练集和验证集两部分,总代币数达到10.42亿,其中训练集占94.9%,验证集占5.1%。数据来源于多个高质量数据集,包括DCLM Baseline、The Stack、Common Corpus、Mini Pile和Math Pile等。数据经过严格的预处理流程,包括标准化、去重、质量过滤等,以确保数据质量。

The Qwen3-Inspired Pre-training Dataset is a high-quality text data mixture designed for large language model pre-training, inspired by the Qwen3 methodology. It includes both training and validation splits, with a total of 1.042 billion tokens, of which the training set accounts for 94.9% and the validation set accounts for 5.1%. The data sources include DCLM Baseline, The Stack, Common Corpus, Mini Pile, and Math Pile, among others. The data has undergone a rigorous preprocessing procedure, including standardization, deduplication, quality filtering, and more, to ensure data quality.
提供机构:
bluelightai-dev
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作