OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)

NIAID Data Ecosystem2026-03-11 收录

下载链接：

https://zenodo.org/record/3257040

下载链接

链接失效反馈

官方服务：

资源简介：

The digital collections of the SBB contain 153,942 digitized works from the time period of 1470 to 1945. At the time of publication, 28,909 works have been OCR-processed resulting in 4,988,099 full-text pages. For each page with OCR text, the language has been determined by langid (Lui/Baldwin 2012). corpus-entropy.pkl entropy rate per document page corpus-language.pkl language per document page corpus.zip fulltext corpus (extracts to .txt format) de_corpus.zip German sub-corpus (extracts to .txt format) selection_de.pkl Selection list of German documents xml2csv_alto.csv fulltext corpus per document page (incl.OCR word confidences) Sources Marco Lui and Timothy Baldwin. 2012. Langid.py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, ACL ’12, pages 25–30, Stroudsburg, PA, USA. Association for Computational Linguistics

创建时间：

2020-01-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集