OCR fulltexts of the Digital Collections of the Berlin State Library (DC-SBB)
收藏Zenodo2020-07-29 更新2026-05-25 收录
下载链接:
https://zenodo.org/record/3257040
下载链接
链接失效反馈官方服务:
资源简介:
The digital collections of the SBB contain 153,942 digitized works from the time period of 1470 to 1945.
At the time of publication, 28,909 works have been OCR-processed resulting in 4,988,099 full-text pages.<br>
For each page with OCR text, the language has been determined by <em>langid </em>(Lui/Baldwin 2012).
corpus-entropy.pkl entropy rate per document page
corpus-language.pkl language per document page
corpus.zip fulltext corpus (extracts to .txt format)
de_corpus.zip German sub-corpus (extracts to .txt format)
selection_de.pkl Selection list of German documents
xml2csv_alto.csv fulltext corpus per document page (incl.OCR word confidences)
<em>Sources</em>
Marco Lui and Timothy Baldwin. 2012. Langid.py:
An off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations,
ACL ’12, pages 25–30, Stroudsburg, PA, USA. Association for Computational Linguistics
提供机构:
Zenodo
创建时间:
2019-06-26



