swswswswsw/rukopys

Name: swswswswsw/rukopys
Creator: swswswswsw
Published: 2026-04-17 13:41:42
License: 暂无描述

Hugging Face2026-04-17 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/swswswswsw/rukopys

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - uk license: cc-by-nc-sa-4.0 task_categories: - object-detection - image-to-text tags: - handwriting-recognition - htr - ocr - bounding-box - ukrainian - document-analysis - cyrillic size_categories: - 10K<n<100K pretty_name: "RUKOPYS: Ukrainian Handwritten Text Recognition Dataset" authors: - Dmytro Voitekh - Volodymyr Zmiivskyyi - Oleksii Molchanovskyi organizations: - Ukrainian Catholic University configs: - config_name: full default: true data_files: - split: train path: - "train/metadata.jsonl" - "train/images/**" - split: silver path: - "silver/metadata.jsonl" - "silver/images/**" - config_name: gt_only data_files: - split: train path: - "train/metadata.jsonl" - "train/images/**" - config_name: test data_files: - split: test path: - "test/metadata.jsonl" - "test/images/**" --- # RUKOPYS: Ukrainian Handwritten Text Recognition Dataset **RUKOPYS** (Ukrainian: *рукопис* — manuscript) is the first large-scale open dataset for Ukrainian handwritten text recognition (HTR). It spans over a century of Ukrainian handwriting — from 1920s archival documents to present-day school homework — and is designed for end-to-end document understanding: region detection, type classification, and text transcription. Ukrainian is among the largest Slavic languages (45M+ native speakers) yet had no dedicated open HTR dataset prior to RUKOPYS. > **Competition:** RUKOPYS powers the [Handwritten to Data](https://www.kaggle.com/competitions/handwritten-to-data) challenge on Kaggle (April 16 — June 15, 2026). Submit your HTR model predictions and compete for $7,000 in prizes. --- ## What Makes RUKOPYS Different Most HTR datasets are built from a single source — one archive, one corpus, one handwriting style. RUKOPYS is deliberately the opposite. It combines four sources that differ across every dimension that makes handwriting recognition hard: | Dimension | Range in RUKOPYS | |-----------|-----------------| | **Time period** | 1919–1935 (archival pen & ink) → 2020–2025 (modern ballpoint, pencil) | | **Writers** | School children (grades 5–11), university students, adult citizens | | **Document type** | Archival state documents, personal dictation sheets, exam papers, homework | | **Capture method** | Flatbed scanner (archive, university) vs phone camera (dictation, school) | | **Orthography** | Archaic pre-reform spelling (1920s) → contemporary Ukrainian | | **Content** | Prose, formulas, chemistry, tables, teacher annotations | This breadth is intentional. A model trained only on clean archival scans will fail on a phone photo of a student notebook — and vice versa. RUKOPYS is designed so that the models trained on it generalize across real-world variation, not just perform well on a narrow slice of it. --- ## Splits | Split | Images | GT Regions | `annotation_source` | Description | |-------|--------|-----------|---------------------|-------------| | **train** | 770 | 16,381 | `annotator` / `volunteer` | Human-annotated — full bboxes + verified transcription | | **silver** | 8,210 | 163,081 | `auto` | Auto-annotated by Qwen3-VL 8B + Gemini — for self-training | | **test** | 386 | — (hidden) | — | Images only — submit predictions to the [Kaggle competition](https://www.kaggle.com/competitions/handwritten-to-data) | | **private benchmark** | 21 | — (hidden until June 15) | — | Held-out set withheld during the competition; published after the online stage closes as a reusable community benchmark | Use `annotation_source` to distinguish human GT from auto-annotations when combining splits. --- ## Data Sources | Source | ID | Period | Images (train+test) | Description | |--------|----|--------|---------------------|-------------| | National Dictation | `dictation` | 2020–2025 | 456 | Phone photos of handwritten Ukrainian National Dictation. One canonical text per year, thousands of unique handwriting styles. | | State Archive | `archive` | 1919–1935 | 169 | Scanned documents from 12 archival funds of the Central State Archive of Ukraine (ЦДАВО). Pen & ink, archaic orthography. | | University (KNUTE) | `university` | 2024–2025 | 246 | Scanned student exam work from 5 faculties: text, math formulas, chemistry, tables. | | School Homework | `school` | 2024–2025 | 285 | Phone photos of school homework (grades 5–11, 20+ subjects) from Opornyi Lyceum s. Zymne (Опорний ліцей с. Зимне). | --- ## Dataset Structure ``` train/ # Human-annotated (770 images) images/{uuid}.jpg metadata.jsonl # bbox + type + language + legibility + text silver/ # Auto-annotated (8,210 images) images/{uuid}.jpg metadata.jsonl # same schema as train test/ # Test images, no annotations (386 images) images/{uuid}.jpg metadata.jsonl # file_name, image_width, image_height, source (regions: null) ``` `train` and `silver` share the same schema and can be combined freely with `concatenate_datasets`. --- ## Loading ### With `datasets` (recommended — loads images as PIL, regions as structured fields) ```python from datasets import load_dataset, concatenate_datasets ds = load_dataset("UkrainianCatholicUniversity/rukopys") # Human-annotated train gt_train = ds["train"] example = gt_train[0] print(example["image"]) # PIL Image print(example["source"]) # "dictation" print(example["annotation_source"]) # "annotator" print(example["regions"]) # [{bbox, type, language, legibility, text}, ...] # Combine GT + silver full_train = concatenate_datasets([gt_train, ds["silver"]]) # GT-only config (no silver): ds_gt = load_dataset("UkrainianCatholicUniversity/rukopys", "gt_only") ``` ### With `pandas` ```python import pandas as pd df_train = pd.read_json("hf://datasets/UkrainianCatholicUniversity/rukopys/train/metadata.jsonl", lines=True) ``` ### With `polars` ```python import polars as pl df_train = pl.read_ndjson("hf://datasets/UkrainianCatholicUniversity/rukopys/train/metadata.jsonl") ``` ### Direct download with `huggingface_hub` ```python from huggingface_hub import snapshot_download path = snapshot_download(repo_id="UkrainianCatholicUniversity/rukopys", repo_type="dataset") # All files under `path` in the original folder structure (train/, silver/, test/) ``` --- ## Annotation Schema Each record in `train` and `silver` has a `regions` field — a list of annotated content regions: ```json { "file_name": "images/abc123.jpg", "image_width": 3024, "image_height": 4032, "source": "dictation", "annotation_source": "annotator", "regions": [ { "bbox": [134, 766, 3754, 1197], "type": "handwritten", "language": "uk", "legibility": "legible", "text": "Спочатку був брехунець. У нього кожного дня: „Клац!"" } ] } ``` `bbox` format: `[x1, y1, x2, y2]` — pixel coordinates, top-left origin. ### Region Types | Type | Description | Transcription | |------|-------------|---------------| | `handwritten` | Handwritten text line | Exact text, 1 bbox = 1 line | | `printed` | Printed/typed text line | Exact text, 1 bbox = 1 line | | `formula` | Standalone math/chemistry expression | LaTeX | | `table` | Full table | Pipe-separated values | | `annotation` | Teacher marks, grades, numbering | Short text | | `image` | Stamps, seals, drawings | Empty | | `graph` | Charts, plots | Empty | ### Special Text Markers | Marker | Meaning | |--------|---------| | `~~word~~` | Strikethrough text | | `~~old~~{new}` | Strikethrough with correction | | `[illegible]` | Unreadable word within a legible line | ### Region Attributes | Attribute | Values | |-----------|--------| | `language` | `uk`, `other` | | `legibility` | `legible`, `illegible` | | `annotation_source` | `annotator`, `volunteer`, `auto` | `annotation_source` values: | Value | Meaning | |-------|---------| | `annotator` | Labeled by [Keymakr](https://keymakr.com/) — professional human annotation service | | `volunteer` | Labeled by community volunteers; spot-checked for quality | | `auto` | Auto-generated by the VLM pipeline (silver split only) | --- ## Anti-Leakage Design | Source | Train | Test | Guarantee | |--------|-------|------|-----------| | **Dictation** | Year 2024 | Years 2020, 2022, 2025 | Different canonical texts | | **Archive** | Archival file set A | Archival file set B | Non-overlapping archival document sets | | **University** | Exam PDF group A | Exam PDF group B | Different students' exam files | | **School** | Grades 5, 6, 7, 9, 11 | Grades 8, 10 | Different grade bands | --- ## Silver Split The `silver` split contains 8,210 auto-annotated images generated by a multi-stage VLM pipeline: ``` Stage 1: Qwen3-VL 8B block detection Stage 2: Gemini Flash block classification Stage 3: Qwen3-VL 8B line segmentation within text blocks Stage 4: Gemini Flash transcription ``` Known limitations: bbox sequence drift on dense text; axis-aligned boxes may clip skewed lines; ~440 archive files contain mixed Ukrainian/Russian text from the 1919–1935 period. --- ## Acknowledgements Professional annotation was provided by [Keymakr](https://keymakr.com/), a human-in-the-loop data annotation company. Additional annotations were contributed by volunteers. The full list of contributors will be published shortly. All volunteer annotations underwent spot-checking for quality assurance. All images were reviewed prior to publication to remove personally identifiable information (PII). --- ## Roadmap This is the first public release of RUKOPYS. The dataset will grow incrementally — both through additional sources and through expanded coverage of existing ones. We welcome collaboration from: - **Annotators** interested in contributing human-verified labels - **Researchers** working on better automatic annotation approaches (layout analysis, HTR pre-annotation, active learning) If you'd like to contribute, reach out via the [Kaggle competition forum](https://www.kaggle.com/competitions/handwritten-to-data/discussion) or open an issue on HuggingFace. --- ## Potential Uses - Fine-tune HTR models on `train`, evaluate on `test` via the [Kaggle competition](https://www.kaggle.com/competitions/handwritten-to-data) - Pseudo-labeling: GT text for each dictation year is publicly known — use it for text-line alignment - Self-training / semi-supervised learning with the `silver` split - Multi-source domain adaptation (modern handwriting → historical documents) --- ## License **CC BY-NC-SA 4.0** — Attribution, Non-Commercial, Share-Alike. - **National Dictation** images: provided under a data sharing agreement for academic research and publication - **State Archive** (ЦДАВО): provided under a data sharing agreement for academic research and publication - **KNUTE** and **Opornyi Lyceum s. Zymne (Опорний ліцей с. Зимне)**: provided under data sharing agreements for academic research and publication --- ## Citation ```bibtex @dataset{rukopys_2026, title = {{RUKOPYS}: Ukrainian Handwritten Text Recognition Dataset}, author = {Dmytro Voitekh and Volodymyr Zmiivskyyi and Oleksii Molchanovskyi}, organization = {Ukrainian Catholic University}, year = {2026}, license = {CC BY-NC-SA 4.0}, url = {https://huggingface.co/UkrainianCatholicUniversity/rukopys}, note = {First large-scale Ukrainian HTR dataset; from 1920s archival documents to 2025 school homework and exams} } ```

提供机构：

swswswswsw

5,000+

优质数据集

54 个

任务类型

进入经典数据集