sol-r/keilschrift-corpus
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sol-r/keilschrift-corpus
下载链接
链接失效反馈官方服务:
资源简介:
# Keilschrift: Ancient Near Eastern Cuneiform Corpus
A unified multilingual corpus of ancient Near Eastern cuneiform texts for
machine learning, covering Sumerian, Akkadian, Old Persian, Elamite, Hittite,
Hurrian, Urartian, and Aramaic — spanning 3400 BCE to 63 BCE.
"Keilschrift" is German for "cuneiform" (literally "wedge-writing").
## Statistics
| | count |
|---|---|
| **corpus passages** | 214,049 |
| **parallel pairs** | 170,199 |
| **languages** | 8 |
| **time span** | 3400 BCE – 63 BCE |
### Corpus by language
| ISO 639-3 | language | family | count | period |
|---|---|---|---|---|
| `sux` | Sumerian | isolate | 182,621 | 3400 BCE – 100 CE |
| `akk` | Akkadian | Semitic (East) | 21,929 | 2500 BCE – 100 CE |
| `peo` | Old Persian | IE (Iranian) | 8,045 | 525 – 330 BCE |
| `elx` | Elamite | isolate | 1,436 | 2600 – 360 BCE |
| `xhu` | Hurrian | Hurro-Urartian | 12 | 2300 – 1200 BCE |
| `hit` | Hittite | IE (Anatolian) | 2 | 1600 – 1178 BCE |
| `xur` | Urartian | Hurro-Urartian | 2 | 860 – 590 BCE |
| `arc` | Aramaic | Semitic (NW) | 2 | 1100 BCE – present |
### Corpus by source
| source | passages |
|---|---|
| CDLI | 131,316 |
| SumTablets | 82,339 |
| ETCSL | 394 |
### Pairs by type
| direction | type | count |
|---|---|---|
| sux → sux | cuneiform unicode → transliteration | 82,339 |
| sux → sux | sign names → transliteration | 82,339 |
| sux → eng | transliteration → translation | 4,455 |
| akk → eng | transliteration → translation | 961 |
| elx → eng | transliteration → translation | 72 |
| peo → eng | transliteration → translation | 29 |
| hit → eng | transliteration → translation | 2 |
| xur → eng | transliteration → translation | 2 |
## Sources and citations
### SumTablets
82,339 Sumerian cuneiform tablets with three-way alignment: Unicode cuneiform
glyphs, sign name sequences, and scholarly transliterations. Derived from ORACC
transliterations mapped to Unicode representations.
- **Citation:** Simmons, C. (2024). "SumTablets: A Transliteration Dataset of
Sumerian Tablets." *Proceedings of the 1st Workshop on Machine Learning for
Ancient Languages (ML4AL 2024)*, ACL Anthology.
doi:[10.18653/v1/2024.ml4al-1.20](https://aclanthology.org/2024.ml4al-1.20/)
- **License:** CC BY 4.0
- **URL:** https://huggingface.co/datasets/colesimmons/SumTablets
### ETCSL (Electronic Text Corpus of Sumerian Literature)
394 Sumerian literary compositions with transliterations and 380 English prose
translations. Includes hymns, myths, epics (Gilgamesh, Enmerkar), laments,
proverbs, debates, and royal hymns.
- **Citation:** Black, J.A., Cunningham, G., Fluckiger-Hawker, E., Robson, E.,
and Zólyomi, G. (1998–2006). *The Electronic Text Corpus of Sumerian
Literature.* Oxford: Faculty of Oriental Studies, University of Oxford.
- **License:** CC BY-NC-SA 3.0
- **URL:** https://etcsl.orinst.ox.ac.uk/
- **Archive:** Oxford Text Archive, handle
[20.500.12024/2518](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2518)
### CDLI (Cuneiform Digital Library Initiative)
131,316 cuneiform texts with ATF transliterations drawn from a catalogue of
353,000+ artifacts. Covers Sumerian, Akkadian, Old Persian, Elamite, Hittite,
Hurrian, Urartian, and Aramaic. 12,559 texts include English translations.
- **Citation:** Englund, R.K. et al. *Cuneiform Digital Library Initiative.*
University of California, Los Angeles / Max Planck Institute for the History
of Science, Berlin.
- **License:** CC BY 4.0
- **URL:** https://cdli.earth/
- **Data:** https://github.com/cdli-gh/data
### ORACC (Open Richly Annotated Cuneiform Corpus)
Not yet included in this release. ORACC provides curated, lemmatized cuneiform
texts with scholarly translations from 140+ projects covering Sumerian,
Akkadian, Hittite, and other languages.
- **Citation:** Tinney, S. et al. *Open Richly Annotated Cuneiform Corpus.*
University of Pennsylvania Museum of Archaeology and Anthropology.
- **License:** CC BY-SA 3.0
- **URL:** https://oracc.museum.upenn.edu/
## Schema
**Corpus (monolingual passages):**
| field | type | description |
|---|---|---|
| `id` | string | unique identifier (`{source}_{original_id}`) |
| `text` | string | text content (transliteration) |
| `text_type` | string | `transliteration` |
| `language` | string | ISO 639-3 code |
| `source` | string | `etcsl`, `cdli`, `sumtablets` |
| `period` | string | historical period (e.g. "Ur III (ca. 2100-2000 BC)") |
| `genre` | string | text genre (e.g. "literary", "administrative") |
| `cdli_id` | string | CDLI P-number if available |
| `work` | string | composition name for literary texts |
| `word_count` | int | word/token count |
**Pairs (parallel texts):**
| field | type | description |
|---|---|---|
| `id` | string | unique identifier |
| `source` | string | data source |
| `language_a` | string | source language ISO 639-3 |
| `language_b` | string | target language ISO 639-3 |
| `text_a` | string | source text |
| `text_b` | string | target text |
| `text_type_a` | string | `cuneiform_unicode`, `sign_names`, or `transliteration` |
| `text_type_b` | string | `transliteration` or `translation` |
| `period` | string | historical period |
| `genre` | string | text genre |
| `cdli_id` | string | CDLI P-number |
| `work` | string | composition name |
## Related projects
- [CuneiML](https://github.com/taineleau/CuneiML) — cuneiform dataset with
tablet photographs and Unicode transcriptions (Liang et al., 2023).
doi:[10.5334/johd.151](https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.151)
- [CompVis cuneiform-sign-detection](https://github.com/CompVis/cuneiform-sign-detection-dataset) —
image-based cuneiform sign detection dataset
## Acknowledgments
This dataset builds on decades of work by assyriologists, sumerologists, and
digital humanities scholars who created CDLI, ORACC, and ETCSL. Special thanks
to Cole Simmons and the SumTablets project for making cuneiform Unicode mappings
available in ML-ready format.
提供机构:
sol-r



