five

sol-r/historica-pairs

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sol-r/historica-pairs
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - grc - la - he - cop - ang - non - got - cu - xcl - en license: cc-by-sa-4.0 task_categories: - translation tags: - ancient-languages - parallel-text - cross-lingual - bilingual pretty_name: Historica Pairs v2 size_categories: - 100K<n<1M --- # Historica Pairs v2 **100,379 aligned text pairs** across 30 language directions, covering ancient Greek, Latin, Hebrew, Coptic, Gothic, Old Church Slavonic, Armenian, Old English, and Old Norse. ## What's New in v2 - **New schema**: `text_a`/`text_b`/`language_a`/`language_b` (replaces misleading `original`/`english`) - **PROIEL punctuation**: reconstructed from `presentation-after` attributes (was depunctuated) - **First1KGreek alignment fixed**: `n`-attribute matching instead of positional (eliminates misaligned translations) - **Length ratio guard**: pairs with >5:1 length mismatch are dropped - **Entity resolution**: saga and edda pairs have entities resolved - **Language codes fixed**: saga Swedish correctly tagged as `swe` (was `sme`) ## Language Pairs | Direction | Pairs | Source | |-----------|------:|--------| | grc↔eng | 34,612 | Perseus, Bible, First1KGreek | | lat↔eng | 29,520 | Perseus, Bible, Corpus Iuris | | heb↔eng | 14,710 | Bible | | cop↔eng | 4,757 | Coptic Bible | | cop↔lat | 4,756 | Coptic Bible | | cop↔grc | 4,752 | Coptic Bible | | ang↔eng | 2,632 | OEDT | | non↔eng | 2,327 | SagaDB, Poetic Edda | | grc↔lat | 418 | First1KGreek | | non↔nob | 350 | SagaDB | | non↔swe | 220 | SagaDB | | non↔fra | 165 | SagaDB | | non↔deu | 150 | SagaDB | | got↔grc | 130 | PROIEL | | got↔lat | 119 | PROIEL | | grc↔chu | 85 | PROIEL | | non↔dan | 75 | SagaDB | | got↔chu | 56 | PROIEL | | xcl↔grc/lat/chu/got | 93 | PROIEL | ## Sources | Source | Pairs | Description | |--------|------:|-------------| | Bible | 54,345 | Hebrew + Greek + Latin + English (verse-aligned) | | Perseus | 15,918 | Classical Greek + Latin ↔ English | | Coptic Bible | 14,265 | Bohairic NT ↔ Greek/Latin/English | | First1KGreek | 7,417 | Greek ↔ English/Latin (section-aligned by `n` attribute) | | OEDT | 2,632 | Old English ↔ Modern English (sentence-aligned) | | SagaDB | 2,081 | Old Norse ↔ English/German/French/Swedish/Norwegian/Danish | | Poetic Edda | 1,573 | Old Norse ↔ English (stanza-aligned) | | Corpus Iuris | 1,366 | Roman law Latin ↔ English (section-aligned) | | PROIEL | 782 | 5-way parallel NT: Greek ↔ Gothic ↔ Latin ↔ OCS ↔ Armenian | ## Schema | Column | Type | Description | |--------|------|-------------| | `id` | string | Unique pair identifier | | `source` | string | Source collection | | `language_a` | string | ISO 639-3 code for text_a | | `language_b` | string | ISO 639-3 code for text_b | | `text_a` | string | Source text | | `text_b` | string | Target text | | `author` | string | Author (where known) | | `work` | string | Work title | | `ref` | string | Internal reference (verse, chapter, section) | | `urn` | string | CTS/URN identifier | | `genre` | string | Genre | | `tradition` | string | Tradition |
提供机构:
sol-r
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作