five

Event Registry titles dataset texts with OCR degradations

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6630827
下载链接
链接失效反馈
官方服务:
资源简介:
This is the text of the Event Registry titles: Rupnik, Jan, Andrej Muhic, Gregor Leban, Primoz Skraba, Blaz Fortuna, et Marko Grobelnik. 2016. « News Across Languages - Cross-Lingual Document Similarity and Event Tracking ». Journal of Artificial Intelligence Research 55 (janvier): 283‑316. https://doi.org/10.1613/jair.4780. Miranda, Sebastião, Artūrs Znotiņš, Shay B. Cohen, et Guntis Barzdins. 2018. « Multilingual Clustering of Streaming News ». In 2018 Conference on Empirical Methods in Natural Language Processing, 4535‑44. Brussels, Belgium: Association for Computational Linguistics. https://www.aclweb.org/anthology/D18-1483/. Guillaume Bernard. (2022). Event Registry titles only dataset with multiple extracted features (both sparse and dense) (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6630447 Some degradations are applied using the DocCreator [1] tool in order to degrade the text of the tweets and to reproduce some common errors found in OCRised documents [2]. [1]: Journet, Nicholas, Muriel Visani, Boris Mansencal, Kieu Van-Cuong, et Antoine Billy. 2017. « DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images ». Journal of Imaging 3 (4): 62. https://doi.org/10.3390/jimaging3040062. [2]: Linhares Pontes, Elvys, Ahmed Hamdi, Nicolas Sidere, et Antoine Doucet. 2019. « Impact of OCR Quality on Named Entity Linking ». In Digital Libraries at the Crossroads of Digital Information for the Future, 11853:102‑15. Lecture Notes in Computer Science. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-34058-2_11. The results of the OCR degradations are as follow: FibVid CER/WER     Without Character degradation Phantom degradation Bleed Blur All Event Registry Titles CER 2.421 6.940 2.414 2.422 2.874 7.178 Event Registry Titles WER 1.127 19.785 1.124 1.131 2.035 19.894
创建时间:
2022-06-10
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作