five

Arabic OCR Corpus (2,894 items from QNL Collection)

收藏
DataCite Commons2025-04-01 更新2025-04-16 收录
下载链接:
https://manara.qnl.qa/articles/dataset/Arabic_OCR_Corpus_2_894_items_from_QNL_Collection_/26984785/1
下载链接
链接失效反馈
官方服务:
资源简介:
<b>Dataset contents</b>This dataset is an OCR text corpus of 2,984 printed works (monographs and serials) from the collection of the Qatar National Library. All works are mostly in Arabic language, but fragments of texts in other languages can also be found. Besides the OCR text, the basic descriptive metadata for each item is also provided.<b>Dataset structure</b>The dataset consists of three files:<b>QNL-ArabicContentDataset-Metadata.csv</b> and <b>QNL-ArabicContentDataset-Metadata.xlsx</b> contain the same basic metadata of 2,894 items from the Qatar National Library collection. Both files have the same content and are structured into the following columns:CALL #(ITEM) - Item call number in the QNL catalogRECORD #(ITEM) - Item record number in the QNL catalog (unique for each item)Repository URL - URL to digitized item content in the QNL repositoryCatalog URL - URL to the complete item metadata record in the QNL catalogAUTHOR - Main author information for the itemADD AUTHOR - Additional author information for the itemPUB INFO - Item publication infoTITLE - Item titleDESCRIPTION - Item descriptionVOLUME - Item volume information (in case of some serial publications)<b>QNL_ArabicOCR_Corpus.zip</b> contains:2,894 text files with the following naming pattern: <b>[unique item record number]-[unique item QNL repository id].txt</b>. The unique item record number should be used to match each file with a related metadata record. Each file contains text extracted from a particular item using OCR software.<b>checksums.sha256</b> - contains SHA256 checksums for all 2,894 text files<br>
提供机构:
Manara - Qatar Research Repository
创建时间:
2024-09-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作