frikishaan/PrimeCorpus-1B
收藏Hugging Face2026-01-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/frikishaan/PrimeCorpus-1B
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: text
dtype: string
- name: source
dtype: string
splits:
- name: train
num_bytes: 4564494450
num_examples: 1058066
download_size: 2685359787
dataset_size: 4564494450
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
task_categories:
- text-generation
language:
- en
pretty_name: PrimeCorpus-1B
---
# PrimeCorpus-1B
PrimeCorpus-1B is a curated 1-billion tokens text dataset created for training small and mid-scale language models. It focuses on educational, encyclopedic, and narrative domains to provide a balanced learning signal, and is created specifically for learning and experimentation.
## Composition
| Source | Tokens |
|---|---|
| [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 500m |
| [finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki) | 300m |
| [Gutenberg books](https://huggingface.co/datasets/incredible45/Gutenberg-BookCorpus-Cleaned-Data-English) | 150m |
| [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) | 50m |
| **Total** | **1 billion** |
_**Note** - Token counts are measured using a GPT-2 tokenizer._
## Processing
- Markdown syntax removed (headers, bold, italic, etc.).
- Gutenberg texts aggressively cleaned due to heavy noise and structural artifacts.
- Any non-english data is removed
- Included only fineweb-edu samples with a score **greater than 4**.
## Intended Use
- Training and evaluation of small language models.
- Designed for pre-training GPT-2–style models focused on prose generation, **not** conversational training.
- Experiments in architecture variations, scaling laws, and curriculum strategies.
- Educational and research-oriented projects.
## Limitations
- Not designed for production-grade or safety-critical applications.
- Domain coverage is intentionally narrow.
- Source biases remain.
## License
Each component retains its original license. Users must ensure compliance with the respective source licenses when redistributing or training models.
提供机构:
frikishaan



