trixyL/simplestories-4k-megatron
收藏Hugging Face2026-02-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/trixyL/simplestories-4k-megatron
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
language:
- en
tags:
- megatron
- 4k
- megadlms
pretty_name: Megatron 4k SimpleStories
size_categories:
- 1M<n<10M
---
<h1 align="center">📦 Megatron-LM/MegaDLMs Preprocessed Dataset</h1>
This **dataset** hosts the Megatron-LM/MegaDLMs preprocessed SimpleStories dataset using an 4k vocab BPE Tokenizer.
## ✅ What this contains
- Preprocessed Megatron dataset files (e.g., `.bin` / `.idx`) ready for Megatron-LM/MegaDLMs training
- BPE Tokenizer config files used to create the dataset
## 🔗 References
Fork with extra preprocessing utils for SimpleStories:
https://github.com/triloy8/MegaDLMs
Original MegaDLMs repo:
https://github.com/JinjieNi/MegaDLMs
SimpleStories dataset:
https://huggingface.co/datasets/SimpleStories/SimpleStories
提供机构:
trixyL



