CZLC/benczechmark_histcorpus

Name: CZLC/benczechmark_histcorpus
Creator: CZLC
Published: 2024-08-22 09:08:36
License: 暂无描述

Hugging Face2024-08-22 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/CZLC/benczechmark_histcorpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - cs --- ## Introduction This is a validation set split off from the historical dataset included in [BUT-LCC](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) corpus. Furthermore, to avoid direct contamination from BUT-LCC, this set is filtered against the historical dataset from BUT-LCC by our fuzzy deduplication pipeline. ## Legal Information & Data Origin This dataset consists of OCR'd documents since 1850, publicly available from the [Czech Digital Library](https://www.digitalniknihovna.cz/). We use [PeroOCR](https://pero-ocr.fit.vutbr.cz/) for optical character recognition (OCR). CZLC members do not own the distributed documents. ## Authors & Contact - Karel Beneš & Martin Kišš (collection and OCR) - Jan Doležal (fuzzy deduplication) - Martin Fajčík (data split off, task management) - Michal Hradiš (PERO Team Lead) Correspondence to `martin.fajcik@vut.cz`

提供机构：

CZLC

5,000+

优质数据集

54 个

任务类型

进入经典数据集