ReadingTimeMachine/rtm-sgt-ocr-v1
收藏数据集介绍
该数据集包含超过150万个合成生成的真实/OCR(光学字符识别)对,用于历史科学文章的OCR后校正任务。这些数据来自我们的论文“Large Synthetic Data from the ar𝜒iv for OCR Post Correction of Historic Scientific Articles”。
合成真实(SGT)句子从ar𝜒iv批量下载源文档中挖掘,而光学字符识别(OCR)句子则通过Tesseract OCR引擎在编译源文档生成的PDF页面上生成。
SGT/OCR对来自1991年至2011年的天文学文章。
所有PDF文档均未应用页面增强(即这些页面是“干净”的,没有扭曲、灰尘等)。
数据集版本
- V0(与原始论文一起发布)可在此处获取:链接
引用
如果您使用此数据集,请引用以下内容:
@inproceedings{10.1007/978-3-031-43849-3_23, author = {Naiman, J. P. and Cosillo, Morgan G. and Williams, Peter K. G. and Goodman, Alyssa}, title = {Large Synthetic Data From the arχiv For OCR Post Correction Of Historic Scientific Articles}, year = {2023}, isbn = {978-3-031-43848-6}, publisher = {Springer-Verlag}, address = {Berlin, Heidelberg}, url = {https://doi.org/10.1007/978-3-031-43849-3_23}, doi = {10.1007/978-3-031-43849-3_23}, abstract = {Historical scientific articles often require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We present a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the arχiv we create, to the authors’ knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. Baseline models trained with this dataset find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. Interactive dashboards to explore the dataset are available online: , and data and code, are hosted on GitHub: .}, booktitle = {Linking Theory and Practice of Digital Libraries: 27th International Conference on Theory and Practice of Digital Libraries, TPDL 2023, Zadar, Croatia, September 26–29, 2023, Proceedings}, pages = {265–274}, numpages = {10}, keywords = {scholarly document processing, optical character recognition, astronomy}, location = {Zadar, Croatia} }



