five

Transcribed newspaper articles from the NCSE collection

收藏
DataCite Commons2025-01-02 更新2025-04-17 收录
下载链接:
https://rdr.ucl.ac.uk/articles/dataset/Transcribed_newspaper_articles_from_the_NCSE_collection/25805008
下载链接
链接失效反馈
官方服务:
资源简介:
CLOCR-C: Transcribed newspaper articles from the NCSE collectionThis dataset contains 91 pairs of newspaper articles from the Nineteenth Century Serials Edition (NCSE). The articles are the original OCR from the NCSE and the transcribed equivalent. The data was used in "CLOCR-C: Context Leveraging OCR Correction with Pre-trained Language Models" to demonstrate that pre-trained language models are able to perform post-OCR correction improve the accuracy of corrupted OCR text. The paper is can be found on arxiv at https://arxiv.org/abs/2408.17428Data DetailsThe data set comes from 6 different publications, and is made up of 91 articles, containing a total of 40712 words distributed across the 19th Century.The dataset is zip file made up of two sub-files containing 91. Each file shares its name with a corresponding file in the other folder.transcription_files: contains .txt files of the transcribed articlestranscription_raw_ocr: contains .txt files of the original OCR<br><br>
提供机构:
University College London
创建时间:
2024-05-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作