sol-r/historica-corpus
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sol-r/historica-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- la
- grc
- en
- enm
- non
- cop
- ang
- de
- got
- cu
- xcl
- nb
- sv
- fr
- da
license: cc-by-sa-4.0
task_categories:
- text-generation
tags:
- ancient-languages
- historical-linguistics
- latin
- ancient-greek
- middle-english
- old-norse
- coptic
- gothic
pretty_name: Historica Corpus v2
size_categories:
- 100K<n<1M
---
# Historica Corpus v2
A monolingual pretraining corpus of **314,438 passages** (177M words, 1.1B characters) spanning 15 ancient and historical languages, from 500 BCE to 1750 CE.
## What's New in v2
- **2x more passages** (314k vs 159k) due to varied-length chunking
- **SGML entity resolution**: Middle English þ/ȝ/ð properly rendered (928k entities fixed)
- **PROIEL punctuation**: reconstructed from `presentation-after` attributes
- **TEI apparatus handling**: `<lem>` (main reading) preserved, `<rdg>` (variants) skipped, `<supplied>` kept
- **OCR artifact cleaning**: line-break hyphenation, column numbers stripped
- **Varied chunk lengths**: 11% short (100-200w), 24% medium (200-400w), 32% mid (400-700w), 23% long (700-1000w), 10% full (1000-1200w)
- **Metadata fixes**: saga `se` → Swedish (not Sami), Coptic not hardcoded as Christian, OE genre not hardcoded as poetry
## Languages
| Language | Code | Passages | Words | Sources |
|----------|------|----------|-------|---------|
| Latin | lat | 196,632 | 125M | PL, CSEL, CAMENA, First1KGreek, Tesserae, Latin Library, CroALa, Corpus Iuris |
| Ancient Greek | grc | 42,324 | 32M | First1KGreek, PROIEL |
| English | eng | 39,261 | 15M | english_trans, Corpus Iuris |
| Middle English | enm | 27,868 | 4M | Michigan ME Corpus |
| Old Norse | non | 3,316 | 0.6M | SagaDB, CLTK, Heimskringla |
| Coptic | cop | 3,105 | 0.4M | Coptic Scriptorium |
| German | deu | 456 | — | First1KGreek translations |
| Old English | ang | 373 | — | OE Sacred, OEDT |
| Norwegian (Bokmål) | nob | 352 | — | SagaDB translations |
| Swedish | swe | 219 | — | SagaDB translations |
| French | fra | 206 | — | SagaDB translations |
| Gothic | got | 134 | — | PROIEL (Wulfila Bible) |
| Old Church Slavonic | chu | 85 | — | PROIEL (Codex Marianus) |
| Danish | dan | 74 | — | SagaDB translations |
| Classical Armenian | xcl | 33 | — | PROIEL |
## Sources
| Source | Passages | Description |
|--------|----------|-------------|
| CAMENA | 84,036 | Neo-Latin literature 1500-1750 (letters, history, poetry, encyclopedias) |
| Patrologia Latina | 60,403 | Church fathers (Latin, 4th-13th c.) |
| First1KGreek | 44,884 | Greek literature 700 BCE-900 CE |
| english_trans | 29,889 | English translations of classical texts (long-s OCR corrected) |
| Middle English | 27,868 | Michigan corpus (SGML entities resolved to Unicode) |
| Latin Library | 24,952 | Classical and medieval Latin |
| Tesserae | 10,896 | Classical Latin (intertextuality project) |
| CSEL | 10,556 | Church fathers (critical editions) |
| Corpus Iuris | 9,490 | Roman law (Latin + English) |
| SagaDB | 3,862 | Old Norse sagas + translations |
| Coptic Scriptorium | 3,102 | Coptic texts |
| Old Norse CLTK | 1,631 | Old Norse poetry + prose |
| PROIEL | 1,505 | Parallel treebank (with reconstructed punctuation) |
| CroALa | 919 | Croatian Latin |
| Others | 1,340 | OE Sacred, OEDT, Heimskringla |
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `source` | string | Source repository/collection |
| `language` | string | ISO 639-3 code |
| `author` | string | Author (where known) |
| `work` | string | Work title |
| `genre` | string | Genre (poetry, history, law, etc.) |
| `tradition` | string | Tradition (christian, secular, norse_pagan) |
| `urn` | string | CTS/URN identifier (where available) |
| `id` | string | Unique passage identifier |
| `text` | string | Passage text (cleaned, entity-resolved) |
| `word_count` | int | Word count |
| `char_count` | int | Character count |
## Extraction
Built with `extract_corpus.py` using a parser class architecture:
- `TEIParser` — CSEL, PL, CAMENA, First1KGreek, CroALa, Coptic (with proper `<lem>`/`<supplied>` handling)
- `EnglishTransParser` — TEI + long-s OCR correction
- `ProielParser` — punctuation reconstructed from `presentation-after`
- `SagaParser` — entity resolution, corrected language codes
- `MiddleEnglishParser` — 203 SGML entities resolved to Unicode
- `PlaintextParser`, `TesseraeParser`, `OldEnglishParser`
提供机构:
sol-r



