sol-r/historica-pairs
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sol-r/historica-pairs
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- grc
- la
- he
- cop
- ang
- non
- got
- cu
- xcl
- en
license: cc-by-sa-4.0
task_categories:
- translation
tags:
- ancient-languages
- parallel-text
- cross-lingual
- bilingual
pretty_name: Historica Pairs v2
size_categories:
- 100K<n<1M
---
# Historica Pairs v2
**100,379 aligned text pairs** across 30 language directions, covering ancient Greek, Latin, Hebrew, Coptic, Gothic, Old Church Slavonic, Armenian, Old English, and Old Norse.
## What's New in v2
- **New schema**: `text_a`/`text_b`/`language_a`/`language_b` (replaces misleading `original`/`english`)
- **PROIEL punctuation**: reconstructed from `presentation-after` attributes (was depunctuated)
- **First1KGreek alignment fixed**: `n`-attribute matching instead of positional (eliminates misaligned translations)
- **Length ratio guard**: pairs with >5:1 length mismatch are dropped
- **Entity resolution**: saga and edda pairs have entities resolved
- **Language codes fixed**: saga Swedish correctly tagged as `swe` (was `sme`)
## Language Pairs
| Direction | Pairs | Source |
|-----------|------:|--------|
| grc↔eng | 34,612 | Perseus, Bible, First1KGreek |
| lat↔eng | 29,520 | Perseus, Bible, Corpus Iuris |
| heb↔eng | 14,710 | Bible |
| cop↔eng | 4,757 | Coptic Bible |
| cop↔lat | 4,756 | Coptic Bible |
| cop↔grc | 4,752 | Coptic Bible |
| ang↔eng | 2,632 | OEDT |
| non↔eng | 2,327 | SagaDB, Poetic Edda |
| grc↔lat | 418 | First1KGreek |
| non↔nob | 350 | SagaDB |
| non↔swe | 220 | SagaDB |
| non↔fra | 165 | SagaDB |
| non↔deu | 150 | SagaDB |
| got↔grc | 130 | PROIEL |
| got↔lat | 119 | PROIEL |
| grc↔chu | 85 | PROIEL |
| non↔dan | 75 | SagaDB |
| got↔chu | 56 | PROIEL |
| xcl↔grc/lat/chu/got | 93 | PROIEL |
## Sources
| Source | Pairs | Description |
|--------|------:|-------------|
| Bible | 54,345 | Hebrew + Greek + Latin + English (verse-aligned) |
| Perseus | 15,918 | Classical Greek + Latin ↔ English |
| Coptic Bible | 14,265 | Bohairic NT ↔ Greek/Latin/English |
| First1KGreek | 7,417 | Greek ↔ English/Latin (section-aligned by `n` attribute) |
| OEDT | 2,632 | Old English ↔ Modern English (sentence-aligned) |
| SagaDB | 2,081 | Old Norse ↔ English/German/French/Swedish/Norwegian/Danish |
| Poetic Edda | 1,573 | Old Norse ↔ English (stanza-aligned) |
| Corpus Iuris | 1,366 | Roman law Latin ↔ English (section-aligned) |
| PROIEL | 782 | 5-way parallel NT: Greek ↔ Gothic ↔ Latin ↔ OCS ↔ Armenian |
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | Unique pair identifier |
| `source` | string | Source collection |
| `language_a` | string | ISO 639-3 code for text_a |
| `language_b` | string | ISO 639-3 code for text_b |
| `text_a` | string | Source text |
| `text_b` | string | Target text |
| `author` | string | Author (where known) |
| `work` | string | Work title |
| `ref` | string | Internal reference (verse, chapter, section) |
| `urn` | string | CTS/URN identifier |
| `genre` | string | Genre |
| `tradition` | string | Tradition |
提供机构:
sol-r



