david-thrower/HelixLM-tiny-5.0Mt-9125pt-715it-20260427
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/david-thrower/HelixLM-tiny-5.0Mt-9125pt-715it-20260427
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个经过精心筛选的缩放子集,来源于三个高质量数据源,并按照推荐比例混合而成。具体包括:1) FineWeb-Edu的sample-10BT分片,约占85%,用于提供高质量的教育类网络文本进行预训练;2) OpenWebMath的train分片,约占10%,用于提供数学推理和STEM内容;3) OpenHermes-2.5的train分片,约占5%,用于提供遵循指令和对话数据。数据集分为预训练和指令微调两个部分:预训练部分(pretrain_train/pretrain_val)包含FineWeb-Edu和OpenWebMath的数据;指令微调部分(instruct_train/instruct_val)包含使用特定格式(如<|system|>、<|user|>、<|assistant|>、<|endoftext|>)处理的OpenHermes-2.5数据。数据准备采用流式加载和混洗缓冲区(10,000)以避免完整语料库下载,每个分片保留2%作为验证集,并支持多种指令格式变体的鲁棒模式检测。
This dataset is a scaled subset curated from three high-quality sources in the recommended ratios. It includes: 1) FineWeb-Edus sample-10BT split, accounting for ~85%, used for high-quality educational web text for pretraining; 2) OpenWebMaths train split, accounting for ~10%, used for mathematical reasoning and STEM content; 3) OpenHermes-2.5s train split, accounting for ~5%, used for instruction-following and conversational data. The dataset is divided into pretraining and instruction tuning parts: the pretraining part (pretrain_train/pretrain_val) contains data from FineWeb-Edu and OpenWebMath; the instruction tuning part (instruct_train/instruct_val) contains OpenHermes-2.5 data formatted with specific tags (e.g., <|system|>, <|user|>, <|assistant|>, <|endoftext|>). Data preparation involves streaming load with a shuffle buffer (10,000) to avoid full corpus downloads, a 2% validation holdout per split, and robust schema detection for various instruction formatting variants.
提供机构:
david-thrower



