david-thrower/HelixLM-tiny-5.0Mt-9125pt-715it-20260427

Name: david-thrower/HelixLM-tiny-5.0Mt-9125pt-715it-20260427
Creator: david-thrower
Published: 2026-04-27 01:21:39
License: 暂无描述

Hugging Face2026-04-27 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/david-thrower/HelixLM-tiny-5.0Mt-9125pt-715it-20260427

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个经过精心筛选的缩放子集，来源于三个高质量数据源，并按照推荐比例混合而成。具体包括：1) FineWeb-Edu的sample-10BT分片，约占85%，用于提供高质量的教育类网络文本进行预训练；2) OpenWebMath的train分片，约占10%，用于提供数学推理和STEM内容；3) OpenHermes-2.5的train分片，约占5%，用于提供遵循指令和对话数据。数据集分为预训练和指令微调两个部分：预训练部分（pretrain_train/pretrain_val）包含FineWeb-Edu和OpenWebMath的数据；指令微调部分（instruct_train/instruct_val）包含使用特定格式（如<|system|>、<|user|>、<|assistant|>、<|endoftext|>）处理的OpenHermes-2.5数据。数据准备采用流式加载和混洗缓冲区（10,000）以避免完整语料库下载，每个分片保留2%作为验证集，并支持多种指令格式变体的鲁棒模式检测。

This dataset is a scaled subset curated from three high-quality sources in the recommended ratios. It includes: 1) FineWeb-Edus sample-10BT split, accounting for ~85%, used for high-quality educational web text for pretraining; 2) OpenWebMaths train split, accounting for ~10%, used for mathematical reasoning and STEM content; 3) OpenHermes-2.5s train split, accounting for ~5%, used for instruction-following and conversational data. The dataset is divided into pretraining and instruction tuning parts: the pretraining part (pretrain_train/pretrain_val) contains data from FineWeb-Edu and OpenWebMath; the instruction tuning part (instruct_train/instruct_val) contains OpenHermes-2.5 data formatted with specific tags (e.g., <|system|>, <|user|>, <|assistant|>, <|endoftext|>). Data preparation involves streaming load with a shuffle buffer (10,000) to avoid full corpus downloads, a 2% validation holdout per split, and robust schema detection for various instruction formatting variants.

提供机构：

david-thrower

5,000+

优质数据集

54 个

任务类型

进入经典数据集