five

LisaMegaWatts/bookcorpus-gutenberg-classics

收藏
Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/LisaMegaWatts/bookcorpus-gutenberg-classics
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en task_categories: - text-generation tags: - bookcorpus - gutenberg - project-gutenberg - philosophy - classical-texts - character-level - curriculum-learning - slm size_categories: - 10M<n<100M dataset_info: features: - name: text dtype: string splits: - name: train num_bytes: 2128762265 num_examples: 13556975 - name: validation num_bytes: 236146830 num_examples: 1506331 download_size: 1610933332 dataset_size: 2364909095 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* --- # BookCorpus + Gutenberg Classics Training Corpus Large-scale training corpus combining BookCorpus fiction, Project Gutenberg 19th-century literature (PG-19), and curated classical philosophy texts. Cleaned, deduplicated, and organized into curriculum phases for character-level language model training. ## Dataset Description This corpus is the primary training dataset for the Julia SLM project, combining three major text sources into a unified, cleaned training set with curriculum-phase annotations for structured learning. ### Source Composition | Source | Files | Chunks (pre-dedup) | Proportion | |--------|------:|-------------------:|-----------:| | [BookCorpus](https://huggingface.co/datasets/bookcorpus/bookcorpus) | 147 | 14,190,796 | 89.4% | | [PG-19](https://huggingface.co/datasets/deepmind/pg19) (Project Gutenberg) | 552 | 1,344,777 | 8.5% | | Classical Philosophy (MIT Classics, Internet Archive, Gutenberg) | 137 | 330,954 | 2.1% | | **Total** | **836** | **15,866,527** | **100%** | ### Cleaning Applied All text has been processed through a multi-stage cleaning pipeline: - **Character filtering**: Lowercased to ASCII set `a-z .,;:?!'"()-` - **Source-specific cleaning**: - BookCorpus: Moses-style detokenization (recombined subword artifacts) - PG-19: Gutenberg boilerplate header/footer removal - Philosophy: LaTeX artifact removal, footnote/reference stripping - **Deduplication**: Exact dedup removed 803,221 duplicates (5.1%) - **Whitespace normalization**: Multi-space collapse, empty line removal ### Curriculum Phases The corpus is organized into three curriculum phases based on the classical trivium/quadrivium education model, suitable for DoReMi-style weighted phase sampling: | Phase | Description | Train Chunks | Proportion | |-------|-------------|-------------:|-----------:| | **Trivium** | Grammar, rhetoric, logic (BookCorpus fiction, classical literature, rhetoric) | 13,475,278 | 99.4% | | **Quadrivium** | Arithmetic, geometry, music, astronomy (Aristotle Physics, Plato Timaeus, Euclid) | 11,652 | 0.08% | | **Philosophy** | Pure philosophy (Kant, Spinoza, Bacon, Seneca, Schopenhauer) | 70,042 | 0.52% | Phase-specific training files are available in the `curriculum/` directory. ### Dataset Statistics | Split | Examples | Size | |-------|----------|------| | Train | 13,556,974 | 2.0 GB | | Validation | 1,506,330 | 221 MB | - **90/10 train/validation split** (shuffled) - **Weighted phase sampling** applied per config: trivium 40%, quadrivium 35%, philosophy 25% ### Philosophy Sources The corpus includes texts from 50+ classical authors spanning Greek, Roman, Medieval, Enlightenment, and Modern philosophy: **Greek**: Aristotle (Metaphysics, Nicomachean Ethics, Politics, Physics, Rhetoric, Poetics, Categories, Prior/Posterior Analytics, Topics, On the Soul, On the Heavens, On Interpretation, Generation and Corruption), Plato (Republic, Laws, Timaeus, Phaedo, Phaedrus, Symposium, Meno, Theaetetus, Protagoras), Herodotus, Thucydides, Xenophon, Aeschylus, Sophocles, Homer, Euripides **Roman**: Marcus Aurelius, Seneca, Epictetus, Cicero, Lucretius, Plutarch, Tacitus, Virgil **Medieval/Renaissance**: Boethius, Machiavelli, Thomas More **Enlightenment**: Descartes, Spinoza, Leibniz, Locke, Berkeley, Hume, Kant, Rousseau, Montesquieu **Modern**: Schopenhauer, Mill, Thoreau, William James **Eastern**: Bhagavad Gita, Sun Tzu, Confucius, Lao Tzu ## Additional Files The `curriculum/` directory contains phase-specific training files: - `train_trivium.txt` - Grammar, rhetoric, and logic texts (2.0 GB) - `train_quadrivium.txt` - Mathematical and natural philosophy texts (2.1 MB) - `train_philosophy.txt` - Pure philosophy texts (13 MB) ## Usage ```python from datasets import load_dataset ds = load_dataset("LisaMegaWatts/bookcorpus-gutenberg-classics") # Training data for example in ds["train"]: text = example["text"] # Download phase-specific files for curriculum training from huggingface_hub import hf_hub_download trivium = hf_hub_download( "LisaMegaWatts/bookcorpus-gutenberg-classics", "curriculum/train_trivium.txt", repo_type="dataset", ) ``` ## Related Datasets - [LisaMegaWatts/philosophy-corpus](https://huggingface.co/datasets/LisaMegaWatts/philosophy-corpus) - Isolated philosophy provenance dataset (for model training lineage) - [LisaMegaWatts/wikitext-103-quality-scored](https://huggingface.co/datasets/LisaMegaWatts/wikitext-103-quality-scored) - Quality-scored WikiText-103 (Wikipedia Featured articles) - [LisaMegaWatts/classical-humanities-corpus](https://huggingface.co/datasets/LisaMegaWatts/classical-humanities-corpus) - Extended classical humanities collection ## License Apache 2.0
提供机构:
LisaMegaWatts
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作