OCR-Ethiopic/HHD-Ethiopic

Name: OCR-Ethiopic/HHD-Ethiopic
Creator: OCR-Ethiopic
Published: 2024-04-26 01:32:04
License: 暂无描述

Hugging Face2024-04-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/OCR-Ethiopic/HHD-Ethiopic

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 --- ## HHD-Ethiopic Dataset This dataset, named "HHD-Ethiopic," is designed for ethiopic text-image recognition tasks. It contains a collection of historical handwritten Manuscripts in the Ethiopic script. The dataset is intended to facilitate research and development for Ethiopic text-image recognition. ### Dataset Details/ - __Size__: 79,684 - __Training Set__: 57,374 - __Test Set__: HHD-Ethiopic consists of two separate Test sets - __Test Set I (IID)__: 6,375 images (randomly drawn from the training set) - __Test Set II (OOD)__: 15,935 images (specifically from manuscripts dated in the 18th century) - __Validation Set__: 10% of the training set, randomly drawn - __Number of unique Ethiopic characters__ :306 - __Dataset Formats__:the HHD-Ethiopic dataset is stored in two different formats to accommodate different use cases: - __Raw Image and Ground-truth Text__: consistes of the original images and their corresponding ground-truth text. The dataset is structured as raw images (.png) accompanied by a [train CSV file](https://huggingface.co/datasets/OCR-Ethiopic/HHD-Ethiopic/blob/main/train/train_raw/image_text_pairs_train.csv), [test-I CSV file](https://huggingface.co/datasets/OCR-Ethiopic/HHD-Ethiopic/blob/main/test/test_rand/image_text_pairs_test_rand.csv), and [test-II CSV file](https://huggingface.co/datasets/OCR-Ethiopic/HHD-Ethiopic/blob/main/test/test_18th/image_text_pairs_test_18th.csv) that contains the file names of the images and their respective ground-truth text for the training and two test sets respectively. -__Numpy Format__: in this format, both the images and the ground-truth text are stored in a convenient numpy format. The dataset provides pre-processed numpy arrays that can be directly used for training and testing models. - __Metadata__(Human Level Performance ): we have also included metadata regarding the human-level performance predicted by individuals for the test sets. This metadata provides insights into the expected performance-level that humans can achieve in historical Ethiopic text-image recognition tasks. - __Test Set I__ - for test set I, a group of 9 individuals was presented with a random subset of the dataset. They were asked to perform Ethiopic text-image recognition and provide their best efforts to transcribe the handwritten texts. The results were collected and stored in a CSV file, [Test-I-human_performance](https://github.com/bdu-birhanu/HHD-Ethiopic/blob/main/Dataset/human-level-predictions/6375_new_all.csv) included in the dataset. - __Test Set II__ - Test set II which was prepared exclusively from Ethiopic historical handwritten documents dated in the 18th century. A different group of 4 individuals was given this subset for evaluation. The human-level performance predictions for this set are also stored in a separate CSV file, [Test-II_human_performance](https://github.com/bdu-birhanu/HHD-Ethiopic/blob/main/Dataset/human-level-predictions/15935_new_all.csv) Please refer to the respective CSV files for detailed information on the human-level performance predictions. Each CSV file contains the necessary metadata, including the image filenames, groind-truth and the corresponding human-generated transcriptions. If you would like to explore or analyze the human-level performance data further, please refer to the provided CSV files. #### Citation If you use the hhd-ethiopic dataset in your research, please consider citing it: ``` @misc {author_2023, author = { {Birhanu Hailu Belay, Isabelle Guyon, Tadele Mengiste, Bezawork Tilahun, Marcus Liwicki, Tesfa Tegegne, and Romain Egele}, title = { HHD-Ethiopic:A Historical Handwritten Dataset for Ethiopic OCR with Baseline Models and Human-level Performance (Revision 50c1e04) }, year = 2023, url = { https://huggingface.co/datasets/OCR-Ethiopic/HHD-Ethiopic }, doi = { 10.57967/hf/0691 }, publisher = { Hugging Face } } ``` #### License <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

提供机构：

OCR-Ethiopic

原始信息汇总

HHD-Ethiopic 数据集概述

数据集基本信息

名称：HHD-Ethiopic
目的：用于Ethiopic文本图像识别任务
内容：包含历史手写Manuscripts的Ethiopic脚本集合

数据集详细信息

总大小：79,684
训练集：57,374
测试集：
- 测试集I (IID)：6,375张图像（随机从训练集中抽取）
- 测试集II (OOD)：15,935张图像（专门从18世纪的手稿中选取）
验证集：训练集的10%，随机抽取
唯一Ethiopic字符数：306

数据集格式

原始图像与真实文本：包含原始图像(.png)及其对应的真实文本，提供训练和两个测试集的CSV文件。
Numpy格式：图像和真实文本以Numpy格式存储，可直接用于模型训练和测试。

元数据（人类水平性能）

测试集I：9人参与评估，结果存储于CSV文件中。
测试集II：4人评估18世纪手稿，结果存储于另一CSV文件中。

引用信息

作者：Birhanu Hailu Belay, Isabelle Guyon, Tadele Mengiste, Bezawork Tilahun, Marcus Liwicki, Tesfa Tegegne, Romain Egele
标题：HHD-Ethiopic: A Historical Handwritten Dataset for Ethiopic OCR with Baseline Models and Human-level Performance
年份：2023
出版者：Hugging Face

许可证

类型：Creative Commons Attribution 4.0 International License

5,000+

优质数据集

54 个

任务类型

进入经典数据集