Event Registry titles dataset texts with OCR degradations
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6630827
下载链接
链接失效反馈官方服务:
资源简介:
This is the text of the Event Registry titles:
Rupnik, Jan, Andrej Muhic, Gregor Leban, Primoz Skraba, Blaz Fortuna, et Marko Grobelnik. 2016. « News Across Languages - Cross-Lingual Document Similarity and Event Tracking ». Journal of Artificial Intelligence Research 55 (janvier): 283‑316. https://doi.org/10.1613/jair.4780.
Miranda, Sebastião, Artūrs Znotiņš, Shay B. Cohen, et Guntis Barzdins. 2018. « Multilingual Clustering of Streaming News ». In 2018 Conference on Empirical Methods in Natural Language Processing, 4535‑44. Brussels, Belgium: Association for Computational Linguistics. https://www.aclweb.org/anthology/D18-1483/.
Guillaume Bernard. (2022). Event Registry titles only dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630447
Some degradations are applied using the DocCreator [1] tool in order to degrade the text of the tweets and to reproduce some common errors found in OCRised documents [2].
[1]: Journet, Nicholas, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, et Antoine Billy. 2017. « DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images ». Journal of Imaging 3 (4): 62. https://doi.org/10.3390/jimaging3040062.
[2]: Linhares Pontes, Elvys, Ahmed Hamdi, Nicolas Sidere, et Antoine Doucet. 2019. « Impact of OCR Quality on Named Entity Linking ». In Digital Libraries at the Crossroads of Digital Information for the Future, 11853:102‑15. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-34058-2_11.
The results of the OCR degradations are as follow:
FibVid CER/WER
Without
Character degradation
Phantom degradation
Bleed
Blur
All
Event Registry Titles
CER
2.421
6.940
2.414
2.422
2.874
7.178
Event Registry Titles
WER
1.127
19.785
1.124
1.131
2.035
19.894
创建时间:
2022-06-10



