five

frikishaan/PrimeCorpus-1B

收藏
Hugging Face2026-01-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/frikishaan/PrimeCorpus-1B
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: text dtype: string - name: source dtype: string splits: - name: train num_bytes: 4564494450 num_examples: 1058066 download_size: 2685359787 dataset_size: 4564494450 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - text-generation language: - en pretty_name: PrimeCorpus-1B --- # PrimeCorpus-1B PrimeCorpus-1B is a curated 1-billion tokens text dataset created for training small and mid-scale language models. It focuses on educational, encyclopedic, and narrative domains to provide a balanced learning signal, and is created specifically for learning and experimentation. ## Composition | Source | Tokens | |---|---| | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 500m | | [finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki) | 300m | | [Gutenberg books](https://huggingface.co/datasets/incredible45/Gutenberg-BookCorpus-Cleaned-Data-English) | 150m | | [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) | 50m | | **Total** | **1 billion** | _**Note** - Token counts are measured using a GPT-2 tokenizer._ ## Processing - Markdown syntax removed (headers, bold, italic, etc.). - Gutenberg texts aggressively cleaned due to heavy noise and structural artifacts. - Any non-english data is removed - Included only fineweb-edu samples with a score **greater than 4**. ## Intended Use - Training and evaluation of small language models. - Designed for pre-training GPT-2–style models focused on prose generation, **not** conversational training. - Experiments in architecture variations, scaling laws, and curriculum strategies. - Educational and research-oriented projects. ## Limitations - Not designed for production-grade or safety-critical applications. - Domain coverage is intentionally narrow. - Source biases remain. ## License Each component retains its original license. Users must ensure compliance with the respective source licenses when redistributing or training models.
提供机构:
frikishaan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作