Arabic OCR Corpus (2,894 items from QNL Collection)
收藏DataCite Commons2025-04-01 更新2025-04-16 收录
下载链接:
https://manara.qnl.qa/articles/dataset/Arabic_OCR_Corpus_2_894_items_from_QNL_Collection_/26984785/1
下载链接
链接失效反馈官方服务:
资源简介:
<b>Dataset contents</b>This dataset is an OCR text corpus of 2,984 printed works (monographs and serials) from the collection of the Qatar National Library. All works are mostly in Arabic language, but fragments of texts in other languages can also be found. Besides the OCR text, the basic descriptive metadata for each item is also provided.<b>Dataset structure</b>The dataset consists of three files:<b>QNL-ArabicContentDataset-Metadata.csv</b> and <b>QNL-ArabicContentDataset-Metadata.xlsx</b> contain the same basic metadata of 2,894 items from the Qatar National Library collection. Both files have the same content and are structured into the following columns:CALL #(ITEM) - Item call number in the QNL catalogRECORD #(ITEM) - Item record number in the QNL catalog (unique for each item)Repository URL - URL to digitized item content in the QNL repositoryCatalog URL - URL to the complete item metadata record in the QNL catalogAUTHOR - Main author information for the itemADD AUTHOR - Additional author information for the itemPUB INFO - Item publication infoTITLE - Item titleDESCRIPTION - Item descriptionVOLUME - Item volume information (in case of some serial publications)<b>QNL_ArabicOCR_Corpus.zip</b> contains:2,894 text files with the following naming pattern: <b>[unique item record number]-[unique item QNL repository id].txt</b>. The unique item record number should be used to match each file with a related metadata record. Each file contains text extracted from a particular item using OCR software.<b>checksums.sha256</b> - contains SHA256 checksums for all 2,894 text files<br>
提供机构:
Manara - Qatar Research Repository
创建时间:
2024-09-12



