sol-r/historica-corpus

Name: sol-r/historica-corpus
Creator: sol-r
Published: 2026-03-24 00:31:26
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/sol-r/historica-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - la - grc - en - enm - non - cop - ang - de - got - cu - xcl - nb - sv - fr - da license: cc-by-sa-4.0 task_categories: - text-generation tags: - ancient-languages - historical-linguistics - latin - ancient-greek - middle-english - old-norse - coptic - gothic pretty_name: Historica Corpus v2 size_categories: - 100K<n<1M --- # Historica Corpus v2 A monolingual pretraining corpus of **314,438 passages** (177M words, 1.1B characters) spanning 15 ancient and historical languages, from 500 BCE to 1750 CE. ## What's New in v2 - **2x more passages** (314k vs 159k) due to varied-length chunking - **SGML entity resolution**: Middle English þ/ȝ/ð properly rendered (928k entities fixed) - **PROIEL punctuation**: reconstructed from `presentation-after` attributes - **TEI apparatus handling**: `<lem>` (main reading) preserved, `<rdg>` (variants) skipped, `<supplied>` kept - **OCR artifact cleaning**: line-break hyphenation, column numbers stripped - **Varied chunk lengths**: 11% short (100-200w), 24% medium (200-400w), 32% mid (400-700w), 23% long (700-1000w), 10% full (1000-1200w) - **Metadata fixes**: saga `se` → Swedish (not Sami), Coptic not hardcoded as Christian, OE genre not hardcoded as poetry ## Languages | Language | Code | Passages | Words | Sources | |----------|------|----------|-------|---------| | Latin | lat | 196,632 | 125M | PL, CSEL, CAMENA, First1KGreek, Tesserae, Latin Library, CroALa, Corpus Iuris | | Ancient Greek | grc | 42,324 | 32M | First1KGreek, PROIEL | | English | eng | 39,261 | 15M | english_trans, Corpus Iuris | | Middle English | enm | 27,868 | 4M | Michigan ME Corpus | | Old Norse | non | 3,316 | 0.6M | SagaDB, CLTK, Heimskringla | | Coptic | cop | 3,105 | 0.4M | Coptic Scriptorium | | German | deu | 456 | — | First1KGreek translations | | Old English | ang | 373 | — | OE Sacred, OEDT | | Norwegian (Bokmål) | nob | 352 | — | SagaDB translations | | Swedish | swe | 219 | — | SagaDB translations | | French | fra | 206 | — | SagaDB translations | | Gothic | got | 134 | — | PROIEL (Wulfila Bible) | | Old Church Slavonic | chu | 85 | — | PROIEL (Codex Marianus) | | Danish | dan | 74 | — | SagaDB translations | | Classical Armenian | xcl | 33 | — | PROIEL | ## Sources | Source | Passages | Description | |--------|----------|-------------| | CAMENA | 84,036 | Neo-Latin literature 1500-1750 (letters, history, poetry, encyclopedias) | | Patrologia Latina | 60,403 | Church fathers (Latin, 4th-13th c.) | | First1KGreek | 44,884 | Greek literature 700 BCE-900 CE | | english_trans | 29,889 | English translations of classical texts (long-s OCR corrected) | | Middle English | 27,868 | Michigan corpus (SGML entities resolved to Unicode) | | Latin Library | 24,952 | Classical and medieval Latin | | Tesserae | 10,896 | Classical Latin (intertextuality project) | | CSEL | 10,556 | Church fathers (critical editions) | | Corpus Iuris | 9,490 | Roman law (Latin + English) | | SagaDB | 3,862 | Old Norse sagas + translations | | Coptic Scriptorium | 3,102 | Coptic texts | | Old Norse CLTK | 1,631 | Old Norse poetry + prose | | PROIEL | 1,505 | Parallel treebank (with reconstructed punctuation) | | CroALa | 919 | Croatian Latin | | Others | 1,340 | OE Sacred, OEDT, Heimskringla | ## Schema | Column | Type | Description | |--------|------|-------------| | `source` | string | Source repository/collection | | `language` | string | ISO 639-3 code | | `author` | string | Author (where known) | | `work` | string | Work title | | `genre` | string | Genre (poetry, history, law, etc.) | | `tradition` | string | Tradition (christian, secular, norse_pagan) | | `urn` | string | CTS/URN identifier (where available) | | `id` | string | Unique passage identifier | | `text` | string | Passage text (cleaned, entity-resolved) | | `word_count` | int | Word count | | `char_count` | int | Character count | ## Extraction Built with `extract_corpus.py` using a parser class architecture: - `TEIParser` — CSEL, PL, CAMENA, First1KGreek, CroALa, Coptic (with proper `<lem>`/`<supplied>` handling) - `EnglishTransParser` — TEI + long-s OCR correction - `ProielParser` — punctuation reconstructed from `presentation-after` - `SagaParser` — entity resolution, corrected language codes - `MiddleEnglishParser` — 203 SGML entities resolved to Unicode - `PlaintextParser`, `TesseraeParser`, `OldEnglishParser`

提供机构：

sol-r

5,000+

优质数据集

54 个

任务类型

进入经典数据集