HeiCuBeDa Hilprecht - Heidelberg Cuneiform Benchmark Dataset for the Hilprecht Collection

Mendeley Data2024-03-27 更新2024-06-29 收录

下载链接：

https://heidata.uni-heidelberg.de/citation?persistentId=doi:10.11588/data/IE8CCN

下载链接

链接失效反馈

官方服务：

资源简介：

The number of known cuneiform tablets is assumed to be in the hundreds of thousands. A fraction has been published by printing photographs and manual tracings in books, which is collected by the online Cuneiform Digital Library Initiative (CDLI) catalog including some of these images and providing metadata for more than 100.000 tablets. While 3D-acquisition of tablets is the most modern way for their documentation, the number of 3D-datasets is relatively small and often not openly accessible. However, the Hilprecht Archive Online (HAO) provides 1977 high-resolution 3D scans of tablets under an Open Access license. While both the HAO and the CDLI are accessible publicly, large-scale machine learning and pattern recognition on cuneiform tablets remains elusive, because the data is only accessible by navigating web pages, the tablet identifiers between collections are inconsistent, and the 3D data is unprepared and challenging for automated processing. We enable large-scale analysis of cuneiform tablets by this HeiCuBeda for Hilprecht assembly, which is a cross-referenced benchmark dataset of processed cuneiform tablets: (i) frontally aligned 3D tablets with pre-computed high-dimensional surface features, (ii) six-views raster images for off-the-shelf image processing, and (iii) metadata, transcriptions, and transliterations, for a subset of 707 tablets, for learning alignment between 3D data, image and linguistic expression. This is the first dataset of its kind, and of its size, in cuneiform research. This benchmark dataset is prepared for ease-of-use and immediate availability for computational researches, lowering the barrier to experiment and apply standard methods of analysis. A script in Python is provided to retrieve and compute an updated JSON database of the CDLI’s metadata and raster images.

已知的楔形文字泥板（cuneiform tablets）数量据推测可达数十万件。其中仅有一小部分通过书籍印刷照片与手工描图的方式得以出版，这些资料由在线楔形文字数字图书馆计划（Cuneiform Digital Library Initiative, CDLI）收录，该目录包含部分此类图像，并为超过10万块泥板提供元数据。尽管对泥板进行三维采集是当前最现代化的文档记录手段，但已公开的三维数据集规模相对较小，且大多无法开放获取。不过，希普雷希特在线档案馆（Hilprecht Archive Online, HAO）以开放获取许可协议，提供了1977份高分辨率的泥板三维扫描件。虽然HAO与CDLI均面向公众开放，但针对楔形文字泥板的大规模机器学习与模式识别研究仍难以开展，原因在于：数据仅能通过网页浏览获取，不同馆藏间的泥板标识符并不统一，且三维数据未经过预处理，难以进行自动化处理。本研究推出的面向希普雷希特馆藏的HeiCuBeda数据集（HeiCuBeda for Hilprecht assembly），为楔形文字泥板的大规模分析提供了支持，这是一个经过交叉引用的标准化基准数据集，包含经处理的楔形文字泥板相关内容：(i) 经正面对齐的三维泥板，附带预计算的高维表面特征；(ii) 六视图光栅图像，可直接用于通用图像处理流程；(iii) 针对707件泥板子集的元数据、转录文本与音译内容，用于学习三维数据、图像与语言表达之间的对应关系。这是楔形文字研究领域中首个同类型且同规模的数据集。该基准数据集经过优化，便于使用并可直接用于计算研究，降低了实验与应用标准分析方法的门槛。此外，还提供了Python脚本，用于获取并生成更新后的CDLI元数据与光栅图像JSON数据库。

创建时间：

2023-09-12

5,000+

优质数据集

54 个

任务类型

进入经典数据集