five

mhla/pre1900-corpus

收藏
Hugging Face2026-03-29 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mhla/pre1900-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit language: - en dataset_info: features: - name: text dtype: string - name: year dtype: int64 - name: title dtype: string - name: source dtype: string - name: ocr_score dtype: float64 - name: legibility dtype: float64 tags: - pre-1900 - historical - physics - nlp --- # Pre-1900 Corpus The training corpus for [GPT-1900](https://huggingface.co/mhla/gpt1900-d34-22btok) — a cleaned collection of pre-1900 English-language texts with full metadata. Every document in this corpus was published before the year 1900. ## Schema | Column | Type | Description | |--------|------|-------------| | `text` | string | Full document text | | `year` | int64 | Publication year | | `title` | string | Book title or newspaper name | | `source` | string | Source dataset identifier | | `ocr_score` | float64 | OCR confidence score (-1.0 if unavailable) | | `legibility` | float64 | Legibility score (-1.0 if unavailable) | ## Sources - **Institutional books** — HathiTrust, Internet Archive, and other digitized book collections - **British Library books** — TheBritishLibrary/blbooks - **Historical newspapers** — dell-research-harvard/AmericanStories ## Filtering Pipeline 1. **OCR cleanup** — removal of OCR artifacts, boilerplate, and unicode normalization 2. **Quality filtering** — token frequency prior-based filtering as a cheap proxy for perplexity 3. **Anachronism detection** — three-tier post-1900 physics filter to remove mislabeled modern texts: - *Always reject*: unambiguous post-1900 terms (photon, spacetime, transistor, etc.) - *Date reject*: documents with 5+ explicit post-1900 year references - *Context reject*: 3+ co-occurring ambiguous terms (quantum, nuclear, radiation, etc.) ## Usage ```python from datasets import load_dataset ds = load_dataset("mhla/pre1900-corpus") ``` ## Related - [mhla/gpt1900-d34-22btok](https://huggingface.co/mhla/gpt1900-d34-22btok) — GPT-1900 base model trained on this corpus - [mhla/gpt1900-physics-clm](https://huggingface.co/datasets/mhla/gpt1900-physics-clm) — Physics texts for continued pretraining - [mhla/gpt1900-instruct-v3-data](https://huggingface.co/datasets/mhla/gpt1900-instruct-v3-data) — Instruction-tuning data
提供机构:
mhla
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作