CoAID dataset texts with OCR degradations
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6630709
下载链接
链接失效反馈官方服务:
资源简介:
This is the text of the CoAID dataset dedicated to fake news detection that has been updated to be used in event detection.
Cui, Limeng, et Dongwon Lee. 2020. « CoAID: COVID-19 Healthcare Misinformation Dataset ». ArXiv:2006.00885 [Cs], novembre. http://arxiv.org/abs/2006.00885.
Guillaume Bernard. (2022). CoAID dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630405
Some degradations are applied using the DocCreator [1] tool in order to degrade the text of the tweets and to reproduce some common errors found in OCRised documents [2].
[1]: Journet, Nicholas, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, et Antoine Billy. 2017. « DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images ». Journal of Imaging 3 (4): 62. https://doi.org/10.3390/jimaging3040062.
[2]: Linhares Pontes, Elvys, Ahmed Hamdi, Nicolas Sidere, et Antoine Doucet. 2019. « Impact of OCR Quality on Named Entity Linking ». In Digital Libraries at the Crossroads of Digital Information for the Future, 11853:102‑15. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-34058-2_11.
The results of the OCR degradations are as follow:
CoAID CER/WER
Without
Character degradation
Phantom degradation
Bleed
Blur
All
CoAID
CER
2.105
6.358
2.105
2.122
2.616
7.898
CoAID
WER
2.494
20.230
2.496
2.580
3.726
20.230
创建时间:
2022-06-10



