five

Benchmark for the evaluation of named entity recognition over ancient documents

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/3877553
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset consists of a multilingual noisy corpora for named entity recognition (NER). The noisy versions are  simulated from the CoNLL-02 (Spanish and Dutch) and CoNLL-03 (English) NER corpora. The original collections are re-OCRed and four types of noises at two different levels are added in order to simulate various OCR output. More precisely, we first extracted raw texts and converted them into images. These images have been contaminated by adding some common noises when using a scanner. We further extract OCRed data using tesseract open source OCR engine v-3.04.01. Consequently to the image noise insertions, OCRed data contains degradations. Original and noisy texts are finally aligned. This archive contains three folders (one per language). The folders contain the degraded images, the noisy texts extracted by the OCR and their aligned version with clean data. These are the supplementary materials for the TPDL 2020 paper Assessing and minimizing the impact of OCR quality on named entity recognition. If you end up using whole or parts of this resource, please cite this paper: @InProceedings{10.1007/978-3-030-54956-5_7, author="Hamdi, Ahmed and Jean-Caurant, Axel and Sid{\`e}re, Nicolas and Coustaty, Micka{\"e}l and Doucet, Antoine", editor="Hall, Mark and Mer{\v{c}}un, Tanja and Risse, Thomas and Duchateau, Fabien", title="Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition", booktitle="Digital Libraries for Open Knowledge", year="2020", publisher="Springer International Publishing", address="Cham", pages="87--101", isbn="978-3-030-54956-5" } Acknowledgments This work has been supported by the European Union's Horizon 2020 research and innovation programme under grant 770299 [NewsEye](https://www.newseye.eu/).
创建时间:
2022-02-21
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作