frikishaan/PrimeCorpus-1B

Name: frikishaan/PrimeCorpus-1B
Creator: frikishaan
Published: 2026-01-13 14:04:12
License: 暂无描述

Hugging Face2026-01-13 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/frikishaan/PrimeCorpus-1B

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: text dtype: string - name: source dtype: string splits: - name: train num_bytes: 4564494450 num_examples: 1058066 download_size: 2685359787 dataset_size: 4564494450 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - text-generation language: - en pretty_name: PrimeCorpus-1B --- # PrimeCorpus-1B PrimeCorpus-1B is a curated 1-billion tokens text dataset created for training small and mid-scale language models. It focuses on educational, encyclopedic, and narrative domains to provide a balanced learning signal, and is created specifically for learning and experimentation. ## Composition | Source | Tokens | |---|---| | [fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 500m | | [finewiki](https://huggingface.co/datasets/HuggingFaceFW/finewiki) | 300m | | [Gutenberg books](https://huggingface.co/datasets/incredible45/Gutenberg-BookCorpus-Cleaned-Data-English) | 150m | | [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) | 50m | | **Total** | **1 billion** | _**Note** - Token counts are measured using a GPT-2 tokenizer._ ## Processing - Markdown syntax removed (headers, bold, italic, etc.). - Gutenberg texts aggressively cleaned due to heavy noise and structural artifacts. - Any non-english data is removed - Included only fineweb-edu samples with a score **greater than 4**. ## Intended Use - Training and evaluation of small language models. - Designed for pre-training GPT-2–style models focused on prose generation, **not** conversational training. - Experiments in architecture variations, scaling laws, and curriculum strategies. - Educational and research-oriented projects. ## Limitations - Not designed for production-grade or safety-critical applications. - Domain coverage is intentionally narrow. - Source biases remain. ## License Each component retains its original license. Users must ensure compliance with the respective source licenses when redistributing or training models.

提供机构：

frikishaan

5,000+

优质数据集

54 个

任务类型

进入经典数据集