five

CZLC/benczechmark_histcorpus

收藏
Hugging Face2024-08-22 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CZLC/benczechmark_histcorpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - cs --- ## Introduction This is a validation set split off from the historical dataset included in [BUT-LCC](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) corpus. Furthermore, to avoid direct contamination from BUT-LCC, this set is filtered against the historical dataset from BUT-LCC by our fuzzy deduplication pipeline. ## Legal Information & Data Origin This dataset consists of OCR'd documents since 1850, publicly available from the [Czech Digital Library](https://www.digitalniknihovna.cz/). We use [PeroOCR](https://pero-ocr.fit.vutbr.cz/) for optical character recognition (OCR). CZLC members do not own the distributed documents. ## Authors & Contact - Karel Beneš & Martin Kišš (collection and OCR) - Jan Doležal (fuzzy deduplication) - Martin Fajčík (data split off, task management) - Michal Hradiš (PERO Team Lead) Correspondence to `martin.fajcik@vut.cz`
提供机构:
CZLC
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作