Arabic OCR Corpus v.2 (2,894 items from QNL Collection)
收藏DataCite Commons2024-11-12 更新2025-04-16 收录
下载链接:
https://manara.qnl.qa/articles/dataset/Arabic_OCR_Corpus_2_894_items_from_QNL_Collection_/26984785
下载链接
链接失效反馈官方服务:
资源简介:
<b>Dataset contents</b>This dataset is an OCR text corpus of 2,984 printed works (monographs and serials) from the collection of the Qatar National Library. All works are mostly in Arabic language, but fragments of texts in other languages can also be found. Besides the OCR text, the basic descriptive metadata for each item is also provided.<b>Release note for version 2 of the dataset</b>The dataset of OCRed Arabic books has been fully updated to ensure consistency and quality. All items in the dataset have now been processed using the latest retrained data. Furthermore, every item has undergone a thorough visual quality assurance check conducted using a representative sample of pages. This update has resulted in a significant enhancement of word-level accuracy across the entire dataset, ensuring higher reliability and usability.The exact list of files changed between version 1 and version 2 of the dataset can be determined by comparing the SHA256 checksums provided with each dataset version (see below for details).<b>Dataset structure</b>The dataset consists of three files:<b>QNL-ArabicContentDataset-Metadata.csv</b> and <b>QNL-ArabicContentDataset-Metadata.xlsx</b> contain the same basic metadata of 2,894 items from the Qatar National Library collection. Both files have the same content and are structured into the following columns:CALL #(ITEM) - Item call number in the QNL catalogRECORD #(ITEM) - Item record number in the QNL catalog (unique for each item)Repository URL - URL to digitized item content in the QNL repositoryCatalog URL - URL to the complete item metadata record in the QNL catalogAUTHOR - Main author information for the itemADD AUTHOR - Additional author information for the itemPUB INFO - Item publication infoTITLE - Item titleDESCRIPTION - Item descriptionVOLUME - Item volume information (in case of some serial publications)<b>QNL_ArabicOCR_Corpus-v2.zip</b> contains:2,894 text files with the following naming pattern: <b>[unique item record number]-[unique item QNL repository id].txt</b>. The unique item record number should be used to match each file with a related metadata record. Each file contains text extracted from a particular item using OCR software.<b>checksums.sha256</b> - contains SHA256 checksums for all 2,894 text files<br>
提供机构:
Manara - Qatar Research Repository
创建时间:
2024-09-11



