bluelightai-dev/clt_pretrain_data

Name: bluelightai-dev/clt_pretrain_data
Creator: bluelightai-dev
Published: 2025-07-11 04:26:44
License: 暂无描述

Hugging Face2025-07-11 更新2025-10-25 收录

下载链接：

https://hf-mirror.com/datasets/bluelightai-dev/clt_pretrain_data

下载链接

链接失效反馈

官方服务：

资源简介：

Qwen3-Inspired Pre-training Dataset是一个受到Qwen3方法启发的、为大型语言模型预训练而设计的高质量文本数据混合集。该数据集包含了训练集和验证集两部分，总代币数达到10.42亿，其中训练集占94.9%，验证集占5.1%。数据来源于多个高质量数据集，包括DCLM Baseline、The Stack、Common Corpus、Mini Pile和Math Pile等。数据经过严格的预处理流程，包括标准化、去重、质量过滤等，以确保数据质量。

The Qwen3-Inspired Pre-training Dataset is a high-quality text data mixture designed for large language model pre-training, inspired by the Qwen3 methodology. It includes both training and validation splits, with a total of 1.042 billion tokens, of which the training set accounts for 94.9% and the validation set accounts for 5.1%. The data sources include DCLM Baseline, The Stack, Common Corpus, Mini Pile, and Math Pile, among others. The data has undergone a rigorous preprocessing procedure, including standardization, deduplication, quality filtering, and more, to ensure data quality.

提供机构：

bluelightai-dev

5,000+

优质数据集

54 个

任务类型

进入经典数据集