mhla/pre1900-corpus
收藏Hugging Face2026-03-29 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mhla/pre1900-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
language:
- en
dataset_info:
features:
- name: text
dtype: string
- name: year
dtype: int64
- name: title
dtype: string
- name: source
dtype: string
- name: ocr_score
dtype: float64
- name: legibility
dtype: float64
tags:
- pre-1900
- historical
- physics
- nlp
---
# Pre-1900 Corpus
The training corpus for [GPT-1900](https://huggingface.co/mhla/gpt1900-d34-22btok) — a cleaned collection of pre-1900 English-language texts with full metadata. Every document in this corpus was published before the year 1900.
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `text` | string | Full document text |
| `year` | int64 | Publication year |
| `title` | string | Book title or newspaper name |
| `source` | string | Source dataset identifier |
| `ocr_score` | float64 | OCR confidence score (-1.0 if unavailable) |
| `legibility` | float64 | Legibility score (-1.0 if unavailable) |
## Sources
- **Institutional books** — HathiTrust, Internet Archive, and other digitized book collections
- **British Library books** — TheBritishLibrary/blbooks
- **Historical newspapers** — dell-research-harvard/AmericanStories
## Filtering Pipeline
1. **OCR cleanup** — removal of OCR artifacts, boilerplate, and unicode normalization
2. **Quality filtering** — token frequency prior-based filtering as a cheap proxy for perplexity
3. **Anachronism detection** — three-tier post-1900 physics filter to remove mislabeled modern texts:
- *Always reject*: unambiguous post-1900 terms (photon, spacetime, transistor, etc.)
- *Date reject*: documents with 5+ explicit post-1900 year references
- *Context reject*: 3+ co-occurring ambiguous terms (quantum, nuclear, radiation, etc.)
## Usage
```python
from datasets import load_dataset
ds = load_dataset("mhla/pre1900-corpus")
```
## Related
- [mhla/gpt1900-d34-22btok](https://huggingface.co/mhla/gpt1900-d34-22btok) — GPT-1900 base model trained on this corpus
- [mhla/gpt1900-physics-clm](https://huggingface.co/datasets/mhla/gpt1900-physics-clm) — Physics texts for continued pretraining
- [mhla/gpt1900-instruct-v3-data](https://huggingface.co/datasets/mhla/gpt1900-instruct-v3-data) — Instruction-tuning data
提供机构:
mhla



