FibVid dataset texts with OCR degradations
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6630757
下载链接
链接失效反馈官方服务:
资源简介:
This is the text of the FibVid dataset dedicated to fake news detection that has been updated to be used in event detection.
Kim, Jisu, Jihwan Aum, SangEun Lee, Yeonju Jang, Eunil Park, et Daejin Choi. 2021. « FibVID: Comprehensive Fake News Diffusion Dataset during the COVID-19 Period ». Telematics and Informatics 64 (novembre): 101688. https://doi.org/10.1016/j.tele.2021.101688.
Guillaume Bernard. (2022). Fibvid dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630409
Some degradations are applied using the DocCreator [1] tool in order to degrade the text of the tweets and to reproduce some common errors found in OCRised documents [2].
[1]: Journet, Nicholas, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, et Antoine Billy. 2017. « DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images ». Journal of Imaging 3 (4): 62. https://doi.org/10.3390/jimaging3040062.
[2]: Linhares Pontes, Elvys, Ahmed Hamdi, Nicolas Sidere, et Antoine Doucet. 2019. « Impact of OCR Quality on Named Entity Linking ». In Digital Libraries at the Crossroads of Digital Information for the Future, 11853:102‑15. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-34058-2_11.
The results of the OCR degradations are as follow:
FibVid CER/WER
Without
Character degradation
Phantom degradation
Bleed
Blur
All
FibVid
CER
1.463
6.089
1.461
1.467
1.935
6.359
FibVid
WER
2.065
20.797
2.041
2.052
2.868
21.396
创建时间:
2022-06-10



