five

Muharaf

收藏
arXiv2024-06-14 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/2406.09630v1
下载链接
链接失效反馈
官方服务:
资源简介:
Muharaf数据集是由北卡罗来纳州立大学等机构创建的一个包含1644个历史手写阿拉伯文页面图像的机器学习数据集。该数据集涵盖了从19世纪初到21世纪初的各种文档类型,包括个人信件、日记、诗歌、教堂记录和法律通信等。创建过程中,专家们对文档图像中的文本行进行了注释和转录,并利用深度学习技术辅助文本预测和手动校正。Muharaf数据集不仅适用于手写文本识别系统,还可用于文本行分割、布局分析、作者识别等多种文本相关任务,旨在解决历史文档的可访问性和研究问题。

The Muharaf Dataset is a machine learning dataset containing 1,644 historical handwritten Arabic page images, created by North Carolina State University and other institutions. This dataset encompasses a diverse range of document types spanning from the early 19th century to the early 21st century, including personal letters, diaries, poems, church records, legal correspondence, and more. During its creation, experts annotated and transcribed text lines within the document images, and utilized deep learning technologies to aid in text prediction and manual correction. The Muharaf Dataset is not only suitable for handwritten text recognition systems, but also supports multiple text-related tasks such as text line segmentation, layout analysis, and author identification, with the goal of addressing accessibility and research challenges related to historical documents.
提供机构:
北卡罗来纳州立大学
创建时间:
2024-06-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作