HeiCuBeDa Hilprecht - Heidelberg Cuneiform Benchmark Dataset for the Hilprecht Collection

Mendeley Data2024-03-27 更新2024-06-29 收录

下载链接：

https://heidata.uni-heidelberg.de/citation?persistentId=doi:10.11588/data/IE8CCN

下载链接

链接失效反馈

官方服务：

资源简介：

The number of known cuneiform tablets is assumed to be in the hundreds of thousands. A fraction has been published by printing photographs and manual tracings in books, which is collected by the online Cuneiform Digital Library Initiative (CDLI) catalog including some of these images and providing metadata for more than 100.000 tablets. While 3D-acquisition of tablets is the most modern way for their documentation, the number of 3D-datasets is relatively small and often not openly accessible. However, the Hilprecht Archive Online (HAO) provides 1977 high-resolution 3D scans of tablets under an Open Access license. While both the HAO and the CDLI are accessible publicly, large-scale machine learning and pattern recognition on cuneiform tablets remains elusive, because the data is only accessible by navigating web pages, the tablet identifiers between collections are inconsistent, and the 3D data is unprepared and challenging for automated processing. We enable large-scale analysis of cuneiform tablets by this HeiCuBeda for Hilprecht assembly, which is a cross-referenced benchmark dataset of processed cuneiform tablets: (i) frontally aligned 3D tablets with pre-computed high-dimensional surface features, (ii) six-views raster images for off-the-shelf image processing, and (iii) metadata, transcriptions, and transliterations, for a subset of 707 tablets, for learning alignment between 3D data, image and linguistic expression. This is the first dataset of its kind, and of its size, in cuneiform research. This benchmark dataset is prepared for ease-of-use and immediate availability for computational researches, lowering the barrier to experiment and apply standard methods of analysis. A script in Python is provided to retrieve and compute an updated JSON database of the CDLI’s metadata and raster images. Up-to-date code and meta-data are also available at https://gitlab.com/fcgl/releases/-/tree/master/mara_icdar_2019.

已知的楔形文字泥板数量据推测可达数十万件。其中仅有一小部分通过书籍印刷照片与手工拓片的形式得以发表，这些资料被在线楔形文字数字图书馆倡议（Cuneiform Digital Library Initiative, CDLI）的目录所收录，该目录包含部分此类影像，并为超过10万件泥板提供了元数据。尽管对泥板进行三维采集是当前最前沿的文物建档方式，但已公开的三维数据集数量相对较少，且大多无法开放获取。不过，希普雷希特在线档案（Hilprecht Archive Online, HAO）以开放获取许可协议，提供了1977件泥板的高分辨率三维扫描数据。尽管HAO与CDLI均面向公众开放，但针对楔形文字泥板的大规模机器学习与模式识别研究仍难以开展，原因在于：数据仅能通过网页浏览获取，不同馆藏间的平板标识符互不统一，且三维数据未经过预处理，难以实现自动化处理。为此，本研究推出面向希普雷希特馆藏的HeiCuBeda数据集，这是一个经过交叉校验的处理后楔形文字泥板基准数据集，包含三类核心内容：(i) 经正面对齐处理的三维泥板，并预计算了高维表面特征；(ii) 六视角光栅图像，可直接用于现成的图像处理流程；(iii) 针对707件泥板子集的元数据、转录文本与转写文本，用于学习三维数据、影像与语言表达之间的关联。这是楔形文字研究领域中首个同类型、同规模的基准数据集。该基准数据集专为便捷使用与即刻投入计算研究而打造，降低了实验与应用标准分析方法的门槛。我们提供了Python脚本，用于获取并更新CDLI元数据与光栅图像的JSON数据库。最新的代码与元数据可通过以下链接获取：https://gitlab.com/fcgl/releases/-/tree/master/mara_icdar_2019。

创建时间：

2023-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集