david-thrower/HelixLM-small-50.0Mt-91250pt-7143it-20260427
收藏Hugging Face2026-04-27 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/david-thrower/HelixLM-small-50.0Mt-91250pt-7143it-20260427
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是为HelixLM模型训练而构建的混合数据集,由三个高质量来源按比例组合而成:FineWeb-Edu(约85%)提供高质量教育网络文本,用于预训练;OpenWebMath(约10%)提供数学推理和STEM内容,用于预训练;OpenHermes-2.5(约5%)提供指令遵循和对话数据,用于指令微调。数据集包含预训练和指令微调两个主要部分,其中预训练部分包括train和val分割,指令微调部分也包含相应的训练和验证分割。数据通过流式加载和混洗处理,支持长文档的滚动分块和自然停止检测,旨在支持HelixLM模型的超个性化和设备端AI应用,适用于小规模语言模型的训练和微调。
This dataset is a curated mixture for training the HelixLM model, composed of three high-quality sources in specified ratios: FineWeb-Edu (approximately 85%) provides high-quality educational web text for pretraining; OpenWebMath (approximately 10%) offers mathematical reasoning and STEM content for pretraining; OpenHermes-2.5 (approximately 5%) supplies instruction-following and conversational data for instruction tuning. The dataset includes two main components: pretraining splits (train and validation) and instruction tuning splits (train and validation). Data is processed via streaming loading with shuffling, supports rolling chunking for long documents and natural stop detection, and is designed to enable hyperpersonalization and on-device AI applications for small-scale language models, facilitating both pretraining and fine-tuning.
提供机构:
david-thrower



