five

BLN600: A Parallel Corpus of Machine/Human Transcribed Nineteenth Century Newspaper Texts

收藏
DataCite Commons2024-09-24 更新2024-07-13 收录
下载链接:
https://orda.shef.ac.uk/articles/dataset/BLN600_A_Parallel_Corpus_of_Machine_Human_Transcribed_Nineteenth_Century_Newspaper_Texts/25439023/1
下载链接
链接失效反馈
官方服务:
资源简介:
BLN600 Human transcriptions of 600 images selected from the British Library Newspapers parts 1 &amp; 2 dataset. <strong>To discourage automated web-scraping for AI training purposes, BLN600 is released as a password protected ZIP archive. The password is BLN600.</strong> Directory Layout Images/ - cropped images downloaded from the BLN platform in mixed JPG and TIFF format - indexed by GALE document ID Ground Truth/ - human transcriptions of the images - indexed by GALE document ID to match up with images OCR Text/ - GALE's OCR transcriptions of the images - indexed by GALE document ID to match up with images metadata.json - structured document data linking document ID with publication information, article count, and non crime counts Document IDs Documents within the GALE BLN system are indexed by a "document ID"---a 10 digit number, prefixed with a 1 or 2 letter collection ID. Two versions of this ID appear to exist - a short form without the collection ID that data has been returned from GALE with, and a longer form with the collection ID that <strong>must</strong> be used when searching the platform site. For example, the 1834-07-07 issue of the Morning Chronicle has been returned from GALE as document ID 3207163457, in order to find this document again in the BLN online platform, you will need the longer form BA3207163457. The two types are bridged in metadata.json. Errors Care has been taken to ensure the ground truth is high quality through multiple error detection, image comparison, and error correction passes, however errors may still remain. BLN600 claims a high, but not 100% accuracy rate. If you have noticed an error in the ground truth, please report it to one of the authors. Access, usage, and modification terms and license Express permission was sought from and granted by GALE on behalf of the company and the British Library partners, and communicated to the authors electronically, for the release of the OCR text of 600 individual excerpts from the British Library Newspapers corpus parts 1 and 2, under a non-commercial use-only license (CC BY-NC-ND 4.0), publicly accessible with no additional access stipulations. This research was funded by a UKRI EPSRC PhD studentship (1st author) and by The University of Sheffield's Centre for Machine Intelligence (2nd author). BLN600 by Callum Booth, Alan Thomas, and Robert Gaizauskas is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ <br>
提供机构:
The University of Sheffield
创建时间:
2024-05-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作