five

sol-r/ancient-aligned

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sol-r/ancient-aligned
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - grc - la - he - en license: cc-by-sa-4.0 task_categories: - sentence-similarity - translation - text-retrieval tags: - ancient-greek - latin - hebrew - classical-languages - cross-lingual - bilingual-pairs - embeddings - perseus - bible size_categories: - 10K<n<100K --- # ancient-aligned bilingual aligned pairs between ancient languages (greek, latin, hebrew) and english, drawn from classical literature and biblical texts. ## overview | | count | |---|---:| | total pairs | 70,381 | | greek-english | 27,421 | | latin-english | 28,248 | | hebrew-english | 14,712 | | perseus works | 168 | | perseus authors | 28 | | bible books | 39 | ## schema | column | type | description | |---|---|---| | `id` | string | unique identifier (e.g. `perseus:homer:1.1.1`, `bible:greek:Gen.1.1`) | | `source` | string | `perseus` or `bible` | | `language` | string | `greek`, `latin`, or `hebrew` | | `original` | string | ancient language text | | `english` | string | english translation | | `author` | string | author name (perseus only) | | `work` | string | work title or bible book abbreviation | | `ref` | string | section/verse reference | | `urn` | string | canonical identifier: CTS URN for perseus (e.g. `urn:cts:greekLit:tlg0003.tlg001:1.1.1`), OSIS ref for bible (e.g. `osis:Gen.1.1`) | ## sources ### perseus digital library (16,004 pairs) aligned passages from 168 works by 28 authors in the [perseus digital library](http://www.perseus.tufts.edu/), covering classical greek and latin literature. authors include homer, thucydides, plutarch, cicero, livy, ovid, tacitus, aeschylus, pindar, hippocrates, lucian, julius caesar, pliny the elder, and others. texts range from 5th century BCE (thucydides, aeschylus) through late antiquity (augustine, bede). ### bible (54,377 pairs) verse-level alignments from: - **greek**: SBLGNT (new testament) + swete's septuagint (old testament) - **latin**: clementine vulgate - **hebrew**: leningrad codex (old testament only, 14,712 verses) - **english**: king james version covers 39 books across old and new testaments. each verse produces separate greek-english, latin-english, and (where available) hebrew-english pairs. ## preprocessing - pairs with either side under 20 characters were filtered (47 rows removed: roman numeral headings, single-word fragments) - no deduplication applied (some bible verses appear in multiple books as cross-references) - perseus texts preserve original polytonic greek diacriticals - bible greek preserves editorial marks from SBLGNT ## intended use - training and evaluating cross-lingual embedding models - ancient language retrieval and search - machine translation for classical languages - cross-lingual alignment research ## length distribution (characters, original side) | source | p10 | p25 | p50 | p75 | p90 | |---|---:|---:|---:|---:|---:| | bible:greek | 59 | 75 | 107 | 146 | 185 | | bible:latin | 59 | 74 | 101 | 138 | 172 | | bible:hebrew | 57 | 69 | 103 | 143 | 181 | | perseus:greek | 154 | 225 | 363 | 693 | 1123 | | perseus:latin | 133 | 179 | 225 | 299 | 870 | ## license CC BY-SA 4.0. perseus texts are distributed under the same license by the perseus digital library. bible source texts are public domain. ## citation if you use this dataset, please cite the perseus digital library: ``` @misc{perseus, title={Perseus Digital Library}, author={Crane, Gregory R.}, url={http://www.perseus.tufts.edu/}, year={2024} } ```
提供机构:
sol-r
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作