CZLC/benczechmark_histcorpus
收藏Hugging Face2024-08-22 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CZLC/benczechmark_histcorpus
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- cs
---
## Introduction
This is a validation set split off from the historical dataset included in [BUT-LCC](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) corpus.
Furthermore, to avoid direct contamination from BUT-LCC, this set is filtered against the historical dataset from BUT-LCC by our fuzzy deduplication pipeline.
## Legal Information & Data Origin
This dataset consists of OCR'd documents since 1850, publicly available from the [Czech Digital Library](https://www.digitalniknihovna.cz/). We use [PeroOCR](https://pero-ocr.fit.vutbr.cz/) for optical character recognition (OCR). CZLC members do not own the distributed documents.
## Authors & Contact
- Karel Beneš & Martin Kišš (collection and OCR)
- Jan Doležal (fuzzy deduplication)
- Martin Fajčík (data split off, task management)
- Michal Hradiš (PERO Team Lead)
Correspondence to `martin.fajcik@vut.cz`
提供机构:
CZLC



