five

OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)

收藏
NIAID Data Ecosystem2026-03-11 收录
下载链接:
https://zenodo.org/record/3257040
下载链接
链接失效反馈
官方服务:
资源简介:
The digital collections of the SBB contain 153,942 digitized works from the time period of 1470 to 1945. At the time of publication, 28,909 works have been OCR-processed resulting in 4,988,099 full-text pages. For each page with OCR text, the language has been determined by langid (Lui/Baldwin 2012). corpus-entropy.pkl      entropy rate per document page corpus-language.pkl   language per document page corpus.zip                    fulltext corpus (extracts to .txt format) de_corpus.zip              German sub-corpus (extracts to .txt format) selection_de.pkl          Selection list of German documents xml2csv_alto.csv         fulltext corpus per document page (incl.OCR word confidences)   Sources Marco Lui and Timothy Baldwin. 2012. Langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, ACL ’12, pages 25–30, Stroudsburg, PA, USA. Association for Computational Linguistics
创建时间:
2020-01-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作