MonumentalSystems/text-pipeline-corpus
收藏Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/MonumentalSystems/text-pipeline-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- text-generation
tags:
- curated
- scientific-papers
- classical-texts
- quality-filtered
- deduplicated
size_categories:
- 100M<n<1B
---
# Text Pipeline Corpus
A curated, quality-filtered training corpus for language model pre-training, built by the Monumental Systems team.
## Dataset Description
This corpus combines scientific papers, classical literature, and educational texts, processed through a rigorous quality pipeline:
- **MTLD lexical diversity filtering** (threshold: 0.72)
- **English language detection** (min score: 0.20)
- **MinHash deduplication** (similarity threshold: 0.8, 128 permutations)
- **N-gram repetition filtering** (max 50% repeated trigrams)
- **Unicode normalization** and metadata stripping
## Domain Splits
| Split | Size | Description | Domain Weight |
|-------|------|-------------|---------------|
| `combined_train_mixedcase.txt` | ~543 MB | Full training corpus (mixed case) | 100% |
| `combined_val.txt` | ~55 MB | Validation split | - |
| `train_quadrivium.txt` | ~2.5 GB | Science, math, technical papers | 35% |
| `train_trivium.txt` | ~138 MB | Grammar, rhetoric, logic, literature | 22% |
| `train_philosophy.txt` | ~24 MB | Classical philosophy texts | subset |
## Sources
- **ArXiv**: 4,220 papers across 162 categories
- **PubMed Central**: 1,325 full-text papers
- **PLOS Journals**: 1,401 open-access papers
- **bioRxiv**: 684 biology preprints
- **Project Gutenberg & MIT Classics**: Classical literature and philosophy
- **WikiText-103**: Expository encyclopedia text
## Domain Weighting (DoReMi-style)
- Science papers: 35%
- Classics & literature: 22%
- Textbooks: 18%
- General knowledge: 15%
- Wikipedia: 10%
## Usage
```python
from datasets import load_dataset
ds = load_dataset("MonumentalSystems/text-pipeline-corpus")
```
Or download individual splits:
```python
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="MonumentalSystems/text-pipeline-corpus",
filename="data/combined_train_mixedcase.txt",
repo_type="dataset",
)
```
## Pipeline
Built with [buildwithbooks/text-pipeline](https://github.com/buildwithbooks/text-pipeline).
提供机构:
MonumentalSystems



