five

OCR-Ethiopic/HHD-Ethiopic

收藏
Hugging Face2024-04-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/OCR-Ethiopic/HHD-Ethiopic
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 --- ## HHD-Ethiopic Dataset This dataset, named "HHD-Ethiopic," is designed for ethiopic text-image recognition tasks. It contains a collection of historical handwritten Manuscripts in the Ethiopic script. The dataset is intended to facilitate research and development for Ethiopic text-image recognition. ### Dataset Details/ - __Size__: 79,684 <br> - __Training Set__: 57,374 <br> - __Test Set__: HHD-Ethiopic consists of two separate Test sets - __Test Set I (IID)__: 6,375 images (randomly drawn from the training set) - __Test Set II (OOD)__: 15,935 images (specifically from manuscripts dated in the 18th century) <br> - __Validation Set__: 10% of the training set, randomly drawn <br> - __Number of unique Ethiopic characters__ :306 - __Dataset Formats__:the HHD-Ethiopic dataset is stored in two different formats to accommodate different use cases: - __Raw Image and Ground-truth Text__: consistes of the original images and their corresponding ground-truth text. The dataset is structured as raw images (.png) accompanied by a [train CSV file](https://huggingface.co/datasets/OCR-Ethiopic/HHD-Ethiopic/blob/main/train/train_raw/image_text_pairs_train.csv), [test-I CSV file](https://huggingface.co/datasets/OCR-Ethiopic/HHD-Ethiopic/blob/main/test/test_rand/image_text_pairs_test_rand.csv), and [test-II CSV file](https://huggingface.co/datasets/OCR-Ethiopic/HHD-Ethiopic/blob/main/test/test_18th/image_text_pairs_test_18th.csv) that contains the file names of the images and their respective ground-truth text for the training and two test sets respectively.<br> -__Numpy Format__: in this format, both the images and the ground-truth text are stored in a convenient numpy format. The dataset provides pre-processed numpy arrays that can be directly used for training and testing models. - __Metadata__(Human Level Performance ): we have also included metadata regarding the human-level performance predicted by individuals for the test sets. This metadata provides insights into the expected performance-level that humans can achieve in historical Ethiopic text-image recognition tasks. - __Test Set I__ - for test set I, a group of 9 individuals was presented with a random subset of the dataset. They were asked to perform Ethiopic text-image recognition and provide their best efforts to transcribe the handwritten texts. The results were collected and stored in a CSV file, [Test-I-human_performance](https://github.com/bdu-birhanu/HHD-Ethiopic/blob/main/Dataset/human-level-predictions/6375_new_all.csv) included in the dataset. - __Test Set II__ - Test set II which was prepared exclusively from Ethiopic historical handwritten documents dated in the 18th century. A different group of 4 individuals was given this subset for evaluation. The human-level performance predictions for this set are also stored in a separate CSV file, [Test-II_human_performance](https://github.com/bdu-birhanu/HHD-Ethiopic/blob/main/Dataset/human-level-predictions/15935_new_all.csv) Please refer to the respective CSV files for detailed information on the human-level performance predictions. Each CSV file contains the necessary metadata, including the image filenames, groind-truth and the corresponding human-generated transcriptions. If you would like to explore or analyze the human-level performance data further, please refer to the provided CSV files. #### Citation If you use the hhd-ethiopic dataset in your research, please consider citing it: ``` @misc {author_2023, author = { {Birhanu Hailu Belay, Isabelle Guyon, Tadele Mengiste, Bezawork Tilahun, Marcus Liwicki, Tesfa Tegegne, and Romain Egele}, title = { HHD-Ethiopic:A Historical Handwritten Dataset for Ethiopic OCR with Baseline Models and Human-level Performance (Revision 50c1e04) }, year = 2023, url = { https://huggingface.co/datasets/OCR-Ethiopic/HHD-Ethiopic }, doi = { 10.57967/hf/0691 }, publisher = { Hugging Face } } ``` #### License <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.
提供机构:
OCR-Ethiopic
原始信息汇总

HHD-Ethiopic 数据集概述

数据集基本信息

  • 名称:HHD-Ethiopic
  • 目的:用于Ethiopic文本图像识别任务
  • 内容:包含历史手写Manuscripts的Ethiopic脚本集合

数据集详细信息

  • 总大小:79,684
  • 训练集:57,374
  • 测试集
    • 测试集I (IID):6,375张图像(随机从训练集中抽取)
    • 测试集II (OOD):15,935张图像(专门从18世纪的手稿中选取)
  • 验证集:训练集的10%,随机抽取
  • 唯一Ethiopic字符数:306

数据集格式

  • 原始图像与真实文本:包含原始图像(.png)及其对应的真实文本,提供训练和两个测试集的CSV文件。
  • Numpy格式:图像和真实文本以Numpy格式存储,可直接用于模型训练和测试。

元数据(人类水平性能)

  • 测试集I:9人参与评估,结果存储于CSV文件中。
  • 测试集II:4人评估18世纪手稿,结果存储于另一CSV文件中。

引用信息

  • 作者:Birhanu Hailu Belay, Isabelle Guyon, Tadele Mengiste, Bezawork Tilahun, Marcus Liwicki, Tesfa Tegegne, Romain Egele
  • 标题:HHD-Ethiopic: A Historical Handwritten Dataset for Ethiopic OCR with Baseline Models and Human-level Performance
  • 年份:2023
  • 出版者:Hugging Face

许可证

  • 类型:Creative Commons Attribution 4.0 International License
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作