HamdiJr/Egyptian_hieroglyphs

Name: HamdiJr/Egyptian_hieroglyphs
Creator: HamdiJr
Published: 2022-07-22 18:31:58
License: 暂无描述

Hugging Face2022-07-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/HamdiJr/Egyptian_hieroglyphs

下载链接

链接失效反馈

官方服务：

资源简介：

# Egyptian hieroglyphs 𓂀 ## _Hieroglyphs image dataset along with Language Model !_ ![code](https://i.ibb.co/WtGgxkz/Screenshot-2022-07-12-214648-transformed.png) ## Features - This dataset is build from the hieroglyphs found in 10 different pictures from the book "The Pyramid of Unas" (Alexandre Piankoff, 1955). We therefore urge you to have access to this book before using the dataset. - The ten different pictures used throughout this dataset are: 3,5,7,9,20,21,22,23,39,41 (numbers represent the numbers used in the book "The pyramid of Unas". - Each hieroglyph is manually annotated and labelled according the Gardiner Sign List. The images are stored with their label and number in their name. ```sh totalImages = 4210 (of which 179 are labelled as UNKNOWN) totalClasses = 171 (excluding the UNKNOWN class) ``` > NOTE: The labelling may not be 100% correct. > This is out of my knowledge as an Egyptian > The hieroglyphs that I was unable to identify are labelled as "UNKNOWN". &emsp; ## Process Aside from the manual annotation, we used a text-detection method to extract the hieroglyphs automatically. The results are shown in `Dataset/Automated/` The labels on automatic detected images are based on a comparison with the manual detection, and are labelled according the the Pascal VOC overlap criteria (50% overlap). The x/y position of each hieroglyph is stored in the Location-folder. Each file in this folder contains the exact position of all (raw) annotated hieroglyphs in their corresponding picture. Example: "030000_S29.png,71,27,105,104," from Dataset/Manual/Locations/3.txt: - image = Dataset/Manual/Raw/3/030000_D35.png - Picture number = 3 (Dataset/Pictures/egyptianTexts3.jpg) - index number = 0 - Gardiner label = D35 - top-left position = 71,27 - bottom-right position = 105,104 (such that width = (105-71) = 34, and the height is (104-27) = 77) Included in this dataset are some tools to create the language model. in `Dataset/LanguageModel/JSESH_EgyptianTexts/` are the Egyptian texts from the JSesh database. Jsesh is an open source program, used to write hieroglyphs [Jsesh](http://jsesh.qenherkhopeshef.org/). The texts are written in a mixture of Gardiner labels and transliteration. Each text can be opened by Jsesh to view the hieroglyphs. Furthermore, a lexicon is included in `Dataset/LanguageModel/Lexicon.txt`. Originally from [OpenGlyp](http://sourceforge.net/projects/openglyph/), but with added word-occurrence based on the EgyptianTexts. Each time a word is encoutered in the text, the word-occurrence is increased by 1 divided by the amount of other possible words that can be made with the surrounding hieroglyphs. The lexicon is organised as follows: each line contains a word, that is made up by a number of hieroglyphs. Other information such as the translation, transliteration and word-occurrence is also stored. Each element is separated by a semicolon. `Example: D36,N35,D7,;an;beautiful;0.333333;` - The 3 hieroglyphs used to write this word: D36,N35,D7, - transliteration: an - English translation: beautiful - word-occurrence: 0.333333 nGrams are included in this dataset as well, under Dataset/LanguageModel/nGrams.txt Each line in this file contains an nGram (either uni-gram, bi-gram or tri-gram) accompanied by their occurrence. `Example: G17,N29,G1,;9;` - Hieroglyphs used to write this tri-gram: G17,N29,G1 - number of occurrences in the EgyptianTexts database: 9 ## Structure The dataset is organised as follows: Dataset/ |---Pictures/     `Contains 10 pictures from the book "The Pyramid of Unas", which are used throughout this dataset` |---Manual/     `Contains the manually annotated images of hieroglyphs` |------Locations/     `Contains the location-files that hold the x/y position of each` |------hieroglyph. |------Preprocessed/     `Contains the pre-processed images` |------Raw/     `Contains the raw, un-pre-processed, images of hieroglyphs` |---Automated/     `Contains the result of the automatic hieroglpyh detection` |------Locations/     `Contains the location-files that hold the x/y position of each ` |------hieroglyph. |------Preprocessed/`Contains the pre-processed images` |------Raw/     `Contains the raw, un-pre-processed, images of hieroglyphs` |---ExampleSet7/     `An example of how the test and train set can be separated.` |------test/     `Simply contains all pre-processed images from picture #7` |------train/     `Contains all the hieroglyphs images from other pictures.` |---Language Model/ |------JSESH_EgyptianTexts/     `Contains the EgyptianTexts database of JSesh, which is a program used to write hieroglyphs` [JSesh link](http://jsesh.qenherkhopeshef.org/). |------Lexicon.txt |------nGrams.txt ## License GPL - non commercial use **What are you waiting for? Make some ✨Magic ✨!**

# 埃及象形文字 𓂀 ## 附带语言模型的象形文字图像数据集 ![代码示例](https://i.ibb.co/WtGgxkz/Screenshot-2022-07-12-214648-transformed.png) ## 核心特性 - 本数据集构建自《乌纳斯金字塔》（Alexandre Piankoff, 1955）一书中10张图片内的象形文字。在此提醒使用者，使用本数据集前请确保已获取该书的合法访问权限。 - 本数据集所用的10张图片分别为该书标注编号3、5、7、9、20、21、22、23、39、41的图像。 - 所有象形文字均经人工标注，并依照**加德纳符号表（Gardiner Sign List）**进行分类标记。图像文件的命名中包含其对应的标签与编号信息。 sh totalImages = 4210 (of which 179 are labelled as UNKNOWN) totalClasses = 171 (excluding the UNKNOWN class) > 注：本次标注无法保证100%准确。 > 受限于本人的埃及学知识储备，我将无法识别的象形文字标记为“UNKNOWN（未知类别）”。 &emsp; ## 处理流程除人工标注外，本数据集还采用了文本检测方法自动提取象形文字，相关结果存储于`Dataset/Automated/`路径下。自动检测图像的标签基于与人工检测结果的比对，依照**帕斯卡VOC（Pascal VOC）**50%重叠度判定标准进行标记。每个象形文字的x/y坐标信息存储于Location文件夹中，该文件夹内的每个文件均记录了对应图片中所有（原始）已标注象形文字的精确位置。举例如`Dataset/Manual/Locations/3.txt`中的条目`030000_S29.png,71,27,105,104,`： - 对应图像：`Dataset/Manual/Raw/3/030000_D35.png` - 图片编号：3（对应`Dataset/Pictures/egyptianTexts3.jpg`） - 索引编号：0 - 加德纳符号标签：D35 - 左上角坐标：71,27 - 右下角坐标：105,104（由此可得图像宽度为105-71=34，高度为104-27=77）本数据集附带部分用于构建语言模型的工具。`Dataset/LanguageModel/JSESH_EgyptianTexts/`路径下存储了JSesh数据库中的埃及语文本。JSesh是一款用于编辑象形文字的开源程序，其官方网址为[JSesh](http://jsesh.qenherkhopeshef.org/)。该文本同时包含加德纳符号标签与音译内容，可通过JSesh程序打开以查看对应的象形文字。此外，`Dataset/LanguageModel/Lexicon.txt`中包含一部词表。该词表最初源自[OpenGlyp](http://sourceforge.net/projects/openglyph/)，并基于埃及语文本库补充了词频信息。每当文本中出现一个词汇时，其词频将增加1除以由周围象形文字可组成的其他可能词汇的总数。词表的组织格式如下：每一行代表一个由若干象形文字组成的词汇，同时存储该词汇的译文、音译及词频等信息，各字段以分号分隔。示例：`D36,N35,D7,;an;beautiful;0.333333;` - 组成该词汇的3个象形文字：D36、N35、D7 - 音译：an - 英文译文：美丽的 - 词频：0.333333 本数据集还包含n元语法（nGrams）数据，存储于`Dataset/LanguageModel/nGrams.txt`中。该文件的每一行代表一个n元语法（可为一元语法、二元语法或三元语法）及其出现次数。示例：`G17,N29,G1,;9;` - 组成该三元语法的象形文字：G17、N29、G1 - 在埃及语文本库中的出现次数：9 ## 数据集组织结构本数据集的整体目录结构如下： Dataset/ |---Pictures/     `存储本数据集所用的10张来自《乌纳斯金字塔》的原始图片` |---Manual/     `存储经人工标注的象形文字图像` |------Locations/     `存储记录各象形文字坐标的位置文件` |------Preprocessed/     `存储经预处理后的图像` |------Raw/     `存储未经预处理的原始象形文字图像` |---Automated/     `存储自动检测得到的象形文字图像结果` |------Locations/     `存储记录各象形文字坐标的位置文件` |------Preprocessed/     `存储经预处理后的图像` |------Raw/     `存储未经预处理的原始象形文字图像` |---ExampleSet7/     `展示训练集与测试集的划分示例` |------test/     `仅包含编号为7的图片对应的所有预处理后象形文字图像` |------train/     `包含其余编号图片对应的所有象形文字图像` |---Language Model/ |------JSESH_EgyptianTexts/     `存储JSesh数据库中的埃及语文本，JSesh是一款用于编辑象形文字的开源程序` [JSesh官方链接](http://jsesh.qenherkhopeshef.org/). |------Lexicon.txt |------nGrams.txt ## 许可证 GPL - 非商业使用 **您还在等什么？快来探索这份神奇的数据集吧！✨**

提供机构：

HamdiJr

原始信息汇总

数据集概述

数据集名称

Egyptian hieroglyphs 𓂀
Hieroglyphs image dataset along with Language Model

数据集特征

来源：数据集构建自书籍《The Pyramid of Unas》（Alexandre Piankoff, 1955）中的10张图片。
图片编号：3, 5, 7, 9, 20, 21, 22, 23, 39, 41。
标注：每个象形文字均手动标注并根据Gardiner Sign List进行标记。
图像命名：图像名称包含其标签和编号。

数据集统计

总图像数：4210张（其中179张标记为UNKNOWN）。
总类别数：171类（不包括UNKNOWN类）。

标注准确性

注意：标注可能不完全准确，未识别的象形文字标记为“UNKNOWN”。

数据处理

手动标注：手动标注的象形文字。
自动检测：使用文本检测方法自动提取象形文字，结果存储在Dataset/Automated/中。
位置信息：每个象形文字的x/y位置存储在Location-folder中。

数据集结构

图片：包含10张来自《The Pyramid of Unas》的图片。
手动标注：包含手动标注的象形文字图像及其位置信息。
自动检测：包含自动检测的象形文字图像及其位置信息。
语言模型：包含JSesh数据库的埃及文本、词典和nGrams。

许可证

GPL：非商业用途。

搜集汇总

数据集介绍

构建方式

在埃及学领域，象形文字的解码依赖于系统的符号标注与语料积累。该数据集以《乌纳斯金字塔》一书中的十幅图像为原始素材，通过人工标注与自动化检测相结合的方式构建。人工标注依据加德纳符号列表对每个象形文字进行精细标记，同时辅以文本检测算法自动提取符号，并依据帕斯卡VOC重叠准则进行标签验证。此外，数据集整合了JSesh数据库的文本语料与OpenGlyp词典，通过统计词频与n-gram模型丰富了语言模型的层次。

特点

作为古埃及文字研究的珍贵资源，该数据集展现出多维度特性。其核心包含4210个象形文字图像，涵盖171个加德纳分类类别，并保留179个未识别符号作为独立类别。数据以原始图像与预处理版本并行存储，且每个符号均附带精确的坐标位置信息。语言模型部分融合了词典、文本语料及n-gram统计，支持符号到词汇的映射分析。数据集结构清晰，划分为手动标注与自动检测两大分支，并附有示例划分方案，为跨模态研究提供便利。

使用方法

在数字人文与计算语言学交叉研究中，该数据集可服务于多类任务。研究者可直接调用手动标注图像进行符号分类或检测模型训练，亦可对比自动检测结果以评估算法性能。语言模型文件支持词汇统计、序列建模及语义分析，通过词典与n-gram文件可重建古埃及文字的统计语言特征。使用前需注意标注存在一定误差，且数据仅限非商业用途，建议结合原始文献进行验证。

背景与挑战

背景概述

古埃及象形文字作为人类早期文明的珍贵遗产，其数字化研究在文化遗产保护与计算语言学领域具有深远意义。HamdiJr/Egyptian_hieroglyphs数据集由研究者基于亚历山大·皮安科夫1955年出版的《乌纳斯金字塔》一书构建，通过手工标注与自动化检测相结合的方式，系统性地提取了十幅图像中的4210个象形文字单元，并依据加德纳符号列表进行归类。该数据集不仅提供了精细的坐标标注与预处理图像，还整合了JSesh文本库的语料资源及词汇统计模型，为象形文字的自动识别、序列建模及语义分析奠定了数据基础，推动了数字人文领域对非拉丁文字系统的计算研究方法创新。

当前挑战

该数据集致力于解决古埃及象形文字的自动检测与分类问题，其核心挑战在于象形文字形态的复杂多变与上下文依赖性强，导致传统图像识别模型难以准确分割和归类。在构建过程中，研究者面临原始文献图像质量不均、符号边界模糊等困难，手工标注需依赖专业埃及学知识，而部分字符因辨识难度被标记为‘未知’类别。此外，自动化检测方法虽能提升效率，但受限于图像噪声与符号重叠，仍需通过帕斯卡VOC重叠准则进行后处理校正，这反映了历史文档数字化中普遍存在的精度与效率平衡难题。

常用场景

经典使用场景

在古埃及象形文字研究领域，该数据集为计算机视觉与自然语言处理的交叉研究提供了关键资源。其经典使用场景集中于训练深度学习模型进行象形文字的自动识别与分类，通过手动标注的4210个图像样本，覆盖171个加德纳符号列表类别，支持构建高精度的字符检测与分类系统。研究者常利用该数据集验证卷积神经网络在复杂历史文字识别任务中的性能，推动数字化人文研究的技术边界。

衍生相关工作

围绕该数据集衍生的经典工作主要集中在多模态历史文档分析领域。例如，结合其图像与位置信息的研究催生了基于注意力机制的象形文字序列生成模型；利用附带的语言模型，学者发展了统计机器翻译方法用于象形文字到现代语言的转换。此外，数据集的结构启发了对《乌纳斯金字塔》文本的跨学科研究，推动了符号学、计算语言学与埃及学的融合创新。

数据集最近研究