five

MonumentalSystems/text-pipeline-corpus

收藏
Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MonumentalSystems/text-pipeline-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 task_categories: - text-generation tags: - curated - scientific-papers - classical-texts - quality-filtered - deduplicated size_categories: - 100M<n<1B --- # Text Pipeline Corpus A curated, quality-filtered training corpus for language model pre-training, built by the Monumental Systems team. ## Dataset Description This corpus combines scientific papers, classical literature, and educational texts, processed through a rigorous quality pipeline: - **MTLD lexical diversity filtering** (threshold: 0.72) - **English language detection** (min score: 0.20) - **MinHash deduplication** (similarity threshold: 0.8, 128 permutations) - **N-gram repetition filtering** (max 50% repeated trigrams) - **Unicode normalization** and metadata stripping ## Domain Splits | Split | Size | Description | Domain Weight | |-------|------|-------------|---------------| | `combined_train_mixedcase.txt` | ~543 MB | Full training corpus (mixed case) | 100% | | `combined_val.txt` | ~55 MB | Validation split | - | | `train_quadrivium.txt` | ~2.5 GB | Science, math, technical papers | 35% | | `train_trivium.txt` | ~138 MB | Grammar, rhetoric, logic, literature | 22% | | `train_philosophy.txt` | ~24 MB | Classical philosophy texts | subset | ## Sources - **ArXiv**: 4,220 papers across 162 categories - **PubMed Central**: 1,325 full-text papers - **PLOS Journals**: 1,401 open-access papers - **bioRxiv**: 684 biology preprints - **Project Gutenberg & MIT Classics**: Classical literature and philosophy - **WikiText-103**: Expository encyclopedia text ## Domain Weighting (DoReMi-style) - Science papers: 35% - Classics & literature: 22% - Textbooks: 18% - General knowledge: 15% - Wikipedia: 10% ## Usage ```python from datasets import load_dataset ds = load_dataset("MonumentalSystems/text-pipeline-corpus") ``` Or download individual splits: ```python from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="MonumentalSystems/text-pipeline-corpus", filename="data/combined_train_mixedcase.txt", repo_type="dataset", ) ``` ## Pipeline Built with [buildwithbooks/text-pipeline](https://github.com/buildwithbooks/text-pipeline).
提供机构:
MonumentalSystems
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作