sol-r/keilschrift-corpus

Name: sol-r/keilschrift-corpus
Creator: sol-r
Published: 2026-03-24 02:45:46
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/sol-r/keilschrift-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

# Keilschrift: Ancient Near Eastern Cuneiform Corpus A unified multilingual corpus of ancient Near Eastern cuneiform texts for machine learning, covering Sumerian, Akkadian, Old Persian, Elamite, Hittite, Hurrian, Urartian, and Aramaic — spanning 3400 BCE to 63 BCE. "Keilschrift" is German for "cuneiform" (literally "wedge-writing"). ## Statistics | | count | |---|---| | **corpus passages** | 214,049 | | **parallel pairs** | 170,199 | | **languages** | 8 | | **time span** | 3400 BCE – 63 BCE | ### Corpus by language | ISO 639-3 | language | family | count | period | |---|---|---|---|---| | `sux` | Sumerian | isolate | 182,621 | 3400 BCE – 100 CE | | `akk` | Akkadian | Semitic (East) | 21,929 | 2500 BCE – 100 CE | | `peo` | Old Persian | IE (Iranian) | 8,045 | 525 – 330 BCE | | `elx` | Elamite | isolate | 1,436 | 2600 – 360 BCE | | `xhu` | Hurrian | Hurro-Urartian | 12 | 2300 – 1200 BCE | | `hit` | Hittite | IE (Anatolian) | 2 | 1600 – 1178 BCE | | `xur` | Urartian | Hurro-Urartian | 2 | 860 – 590 BCE | | `arc` | Aramaic | Semitic (NW) | 2 | 1100 BCE – present | ### Corpus by source | source | passages | |---|---| | CDLI | 131,316 | | SumTablets | 82,339 | | ETCSL | 394 | ### Pairs by type | direction | type | count | |---|---|---| | sux → sux | cuneiform unicode → transliteration | 82,339 | | sux → sux | sign names → transliteration | 82,339 | | sux → eng | transliteration → translation | 4,455 | | akk → eng | transliteration → translation | 961 | | elx → eng | transliteration → translation | 72 | | peo → eng | transliteration → translation | 29 | | hit → eng | transliteration → translation | 2 | | xur → eng | transliteration → translation | 2 | ## Sources and citations ### SumTablets 82,339 Sumerian cuneiform tablets with three-way alignment: Unicode cuneiform glyphs, sign name sequences, and scholarly transliterations. Derived from ORACC transliterations mapped to Unicode representations. - **Citation:** Simmons, C. (2024). "SumTablets: A Transliteration Dataset of Sumerian Tablets." *Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)*, ACL Anthology. doi:[10.18653/v1/2024.ml4al-1.20](https://aclanthology.org/2024.ml4al-1.20/) - **License:** CC BY 4.0 - **URL:** https://huggingface.co/datasets/colesimmons/SumTablets ### ETCSL (Electronic Text Corpus of Sumerian Literature) 394 Sumerian literary compositions with transliterations and 380 English prose translations. Includes hymns, myths, epics (Gilgamesh, Enmerkar), laments, proverbs, debates, and royal hymns. - **Citation:** Black, J.A., Cunningham, G., Fluckiger-Hawker, E., Robson, E., and Zólyomi, G. (1998–2006). *The Electronic Text Corpus of Sumerian Literature.* Oxford: Faculty of Oriental Studies, University of Oxford. - **License:** CC BY-NC-SA 3.0 - **URL:** https://etcsl.orinst.ox.ac.uk/ - **Archive:** Oxford Text Archive, handle [20.500.12024/2518](https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2518) ### CDLI (Cuneiform Digital Library Initiative) 131,316 cuneiform texts with ATF transliterations drawn from a catalogue of 353,000+ artifacts. Covers Sumerian, Akkadian, Old Persian, Elamite, Hittite, Hurrian, Urartian, and Aramaic. 12,559 texts include English translations. - **Citation:** Englund, R.K. et al. *Cuneiform Digital Library Initiative.* University of California, Los Angeles / Max Planck Institute for the History of Science, Berlin. - **License:** CC BY 4.0 - **URL:** https://cdli.earth/ - **Data:** https://github.com/cdli-gh/data ### ORACC (Open Richly Annotated Cuneiform Corpus) Not yet included in this release. ORACC provides curated, lemmatized cuneiform texts with scholarly translations from 140+ projects covering Sumerian, Akkadian, Hittite, and other languages. - **Citation:** Tinney, S. et al. *Open Richly Annotated Cuneiform Corpus.* University of Pennsylvania Museum of Archaeology and Anthropology. - **License:** CC BY-SA 3.0 - **URL:** https://oracc.museum.upenn.edu/ ## Schema **Corpus (monolingual passages):** | field | type | description | |---|---|---| | `id` | string | unique identifier (`{source}_{original_id}`) | | `text` | string | text content (transliteration) | | `text_type` | string | `transliteration` | | `language` | string | ISO 639-3 code | | `source` | string | `etcsl`, `cdli`, `sumtablets` | | `period` | string | historical period (e.g. "Ur III (ca. 2100-2000 BC)") | | `genre` | string | text genre (e.g. "literary", "administrative") | | `cdli_id` | string | CDLI P-number if available | | `work` | string | composition name for literary texts | | `word_count` | int | word/token count | **Pairs (parallel texts):** | field | type | description | |---|---|---| | `id` | string | unique identifier | | `source` | string | data source | | `language_a` | string | source language ISO 639-3 | | `language_b` | string | target language ISO 639-3 | | `text_a` | string | source text | | `text_b` | string | target text | | `text_type_a` | string | `cuneiform_unicode`, `sign_names`, or `transliteration` | | `text_type_b` | string | `transliteration` or `translation` | | `period` | string | historical period | | `genre` | string | text genre | | `cdli_id` | string | CDLI P-number | | `work` | string | composition name | ## Related projects - [CuneiML](https://github.com/taineleau/CuneiML) — cuneiform dataset with tablet photographs and Unicode transcriptions (Liang et al., 2023). doi:[10.5334/johd.151](https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.151) - [CompVis cuneiform-sign-detection](https://github.com/CompVis/cuneiform-sign-detection-dataset) — image-based cuneiform sign detection dataset ## Acknowledgments This dataset builds on decades of work by assyriologists, sumerologists, and digital humanities scholars who created CDLI, ORACC, and ETCSL. Special thanks to Cole Simmons and the SumTablets project for making cuneiform Unicode mappings available in ML-ready format.

提供机构：

sol-r

5,000+

优质数据集

54 个

任务类型

进入经典数据集