HeiCuBeDa Hilprecht - Heidelberg Cuneiform Benchmark Dataset for the Hilprecht Collection

Name: HeiCuBeDa Hilprecht - Heidelberg Cuneiform Benchmark Dataset for the Hilprecht Collection
Creator: heiDATA
Published: 2025-01-28 12:48:53
License: 暂无描述

DataCite Commons2025-01-28 更新2025-04-17 收录

下载链接：

https://heidata.uni-heidelberg.de/citation?persistentId=doi:10.11588/DATA/IE8CCN

下载链接

链接失效反馈

官方服务：

资源简介：

The number of known cuneiform tablets is assumed to be in the hundreds of thousands. A fraction has been published by printing photographs and manual tracings in books, which is collected by the online Cuneiform Digital Library Initiative (CDLI) catalog including some of these images and providing metadata for more than 100.000 tablets. While 3D-acquisition of tablets is the most modern way for their documentation, the number of 3D-datasets is relatively small and often not openly accessible. However, the Hilprecht Archive Online (HAO) provides 1977 high-resolution 3D scans of tablets under an Open Access license. While both the HAO and the CDLI are accessible publicly, large-scale machine learning and pattern recognition on cuneiform tablets remains elusive, because the data is only accessible by navigating web pages, the tablet identifiers between collections are inconsistent, and the 3D data is unprepared and challenging for automated processing. We enable large-scale analysis of cuneiform tablets by this HeiCuBeda for Hilprecht assembly, which is a cross-referenced benchmark dataset of processed cuneiform tablets: (i) frontally aligned 3D tablets with pre-computed high-dimensional surface features, (ii) six-views raster images for off-the-shelf image processing, and (iii) metadata, transcriptions, and transliterations, for a subset of 707 tablets, for learning alignment between 3D data, image and linguistic expression. This is the first dataset of its kind, and of its size, in cuneiform research. This benchmark dataset is prepared for ease-of-use and immediate availability for computational researches, lowering the barrier to experiment and apply standard methods of analysis. A script in Python is provided to retrieve and compute an updated JSON database of the CDLI’s metadata and raster images. Up-to-date code and meta-data are also available at <a href="https://gitlab.com/fcgl/releases/-/tree/master/mara_icdar_2019">https://gitlab.com/fcgl/releases/-/tree/master/mara_icdar_2019</a>.

已知楔形文字板（cuneiform tablets）的现存数量据估算可达数十万。其中仅有一小部分通过书籍印刷照片与手工描图的方式得以出版，相关资料被纳入楔形文字数字图书馆计划（Cuneiform Digital Library Initiative, CDLI）的目录，该目录收录了其中部分影像，并为超过10万件楔形文字板提供元数据。尽管对楔形文字板进行三维采集是当前最先进的文献记录手段，但已公开的三维数据集规模相对较小，且大多无法开放获取。不过希普雷希特在线档案馆（Hilprecht Archive Online, HAO）以开放获取许可协议，提供了1977件高分辨率楔形文字板三维扫描模型。尽管HAO与CDLI均面向公众开放，但针对楔形文字板的大规模机器学习与模式识别研究仍难以开展，原因在于：现有数据仅能通过网页浏览获取，不同馆藏的板件标识符不统一，且三维数据未经过标准化预处理，难以支持自动化处理。本研究推出的HeiCuBeda希普雷希特合集数据集，为楔形文字板的大规模分析提供了支撑。该数据集是经过交叉校验的标准化预处理基准数据集，包含三类核心内容：(i) 经正面对齐处理的三维板件模型，并预计算了高维表面特征；(ii) 六视角光栅图像，可直接适配通用图像处理流程；(iii) 针对707件板件的元数据、转录文本与转写结果，用于学习三维数据、图像与语言表达之间的对应关系。这是楔形文字研究领域内同类型中规模首屈一指的基准数据集。该数据集经过优化，便于直接用于计算科学研究，降低了开展实验与应用标准化分析方法的门槛。本研究提供了Python脚本，用于获取并更新CDLI元数据与光栅图像的JSON数据库。最新的代码与元数据可访问以下链接：<a href="https://gitlab.com/fcgl/releases/-/tree/master/mara_icdar_2019">https://gitlab.com/fcgl/releases/-/tree/master/mara_icdar_2019</a>.

提供机构：

heiDATA

创建时间：

2019-06-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集