swswswswsw/rukopys
收藏Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/swswswswsw/rukopys
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- uk
license: cc-by-nc-sa-4.0
task_categories:
- object-detection
- image-to-text
tags:
- handwriting-recognition
- htr
- ocr
- bounding-box
- ukrainian
- document-analysis
- cyrillic
size_categories:
- 10K<n<100K
pretty_name: "RUKOPYS: Ukrainian Handwritten Text Recognition Dataset"
authors:
- Dmytro Voitekh
- Volodymyr Zmiivskyyi
- Oleksii Molchanovskyi
organizations:
- Ukrainian Catholic University
configs:
- config_name: full
default: true
data_files:
- split: train
path:
- "train/metadata.jsonl"
- "train/images/**"
- split: silver
path:
- "silver/metadata.jsonl"
- "silver/images/**"
- config_name: gt_only
data_files:
- split: train
path:
- "train/metadata.jsonl"
- "train/images/**"
- config_name: test
data_files:
- split: test
path:
- "test/metadata.jsonl"
- "test/images/**"
---
# RUKOPYS: Ukrainian Handwritten Text Recognition Dataset
**RUKOPYS** (Ukrainian: *рукопис* — manuscript) is the first large-scale open dataset for Ukrainian handwritten text recognition (HTR). It spans over a century of Ukrainian handwriting — from 1920s archival documents to present-day school homework — and is designed for end-to-end document understanding: region detection, type classification, and text transcription.
Ukrainian is among the largest Slavic languages (45M+ native speakers) yet had no dedicated open HTR dataset prior to RUKOPYS.
> **Competition:** RUKOPYS powers the [Handwritten to Data](https://www.kaggle.com/competitions/handwritten-to-data) challenge on Kaggle (April 16 — June 15, 2026). Submit your HTR model predictions and compete for $7,000 in prizes.
---
## What Makes RUKOPYS Different
Most HTR datasets are built from a single source — one archive, one corpus, one handwriting style. RUKOPYS is deliberately the opposite.
It combines four sources that differ across every dimension that makes handwriting recognition hard:
| Dimension | Range in RUKOPYS |
|-----------|-----------------|
| **Time period** | 1919–1935 (archival pen & ink) → 2020–2025 (modern ballpoint, pencil) |
| **Writers** | School children (grades 5–11), university students, adult citizens |
| **Document type** | Archival state documents, personal dictation sheets, exam papers, homework |
| **Capture method** | Flatbed scanner (archive, university) vs phone camera (dictation, school) |
| **Orthography** | Archaic pre-reform spelling (1920s) → contemporary Ukrainian |
| **Content** | Prose, formulas, chemistry, tables, teacher annotations |
This breadth is intentional. A model trained only on clean archival scans will fail on a phone photo of a student notebook — and vice versa. RUKOPYS is designed so that the models trained on it generalize across real-world variation, not just perform well on a narrow slice of it.
---
## Splits
| Split | Images | GT Regions | `annotation_source` | Description |
|-------|--------|-----------|---------------------|-------------|
| **train** | 770 | 16,381 | `annotator` / `volunteer` | Human-annotated — full bboxes + verified transcription |
| **silver** | 8,210 | 163,081 | `auto` | Auto-annotated by Qwen3-VL 8B + Gemini — for self-training |
| **test** | 386 | — (hidden) | — | Images only — submit predictions to the [Kaggle competition](https://www.kaggle.com/competitions/handwritten-to-data) |
| **private benchmark** | 21 | — (hidden until June 15) | — | Held-out set withheld during the competition; published after the online stage closes as a reusable community benchmark |
Use `annotation_source` to distinguish human GT from auto-annotations when combining splits.
---
## Data Sources
| Source | ID | Period | Images (train+test) | Description |
|--------|----|--------|---------------------|-------------|
| National Dictation | `dictation` | 2020–2025 | 456 | Phone photos of handwritten Ukrainian National Dictation. One canonical text per year, thousands of unique handwriting styles. |
| State Archive | `archive` | 1919–1935 | 169 | Scanned documents from 12 archival funds of the Central State Archive of Ukraine (ЦДАВО). Pen & ink, archaic orthography. |
| University (KNUTE) | `university` | 2024–2025 | 246 | Scanned student exam work from 5 faculties: text, math formulas, chemistry, tables. |
| School Homework | `school` | 2024–2025 | 285 | Phone photos of school homework (grades 5–11, 20+ subjects) from Opornyi Lyceum s. Zymne (Опорний ліцей с. Зимне). |
---
## Dataset Structure
```
train/ # Human-annotated (770 images)
images/{uuid}.jpg
metadata.jsonl # bbox + type + language + legibility + text
silver/ # Auto-annotated (8,210 images)
images/{uuid}.jpg
metadata.jsonl # same schema as train
test/ # Test images, no annotations (386 images)
images/{uuid}.jpg
metadata.jsonl # file_name, image_width, image_height, source (regions: null)
```
`train` and `silver` share the same schema and can be combined freely with `concatenate_datasets`.
---
## Loading
### With `datasets` (recommended — loads images as PIL, regions as structured fields)
```python
from datasets import load_dataset, concatenate_datasets
ds = load_dataset("UkrainianCatholicUniversity/rukopys")
# Human-annotated train
gt_train = ds["train"]
example = gt_train[0]
print(example["image"]) # PIL Image
print(example["source"]) # "dictation"
print(example["annotation_source"]) # "annotator"
print(example["regions"]) # [{bbox, type, language, legibility, text}, ...]
# Combine GT + silver
full_train = concatenate_datasets([gt_train, ds["silver"]])
# GT-only config (no silver):
ds_gt = load_dataset("UkrainianCatholicUniversity/rukopys", "gt_only")
```
### With `pandas`
```python
import pandas as pd
df_train = pd.read_json("hf://datasets/UkrainianCatholicUniversity/rukopys/train/metadata.jsonl", lines=True)
```
### With `polars`
```python
import polars as pl
df_train = pl.read_ndjson("hf://datasets/UkrainianCatholicUniversity/rukopys/train/metadata.jsonl")
```
### Direct download with `huggingface_hub`
```python
from huggingface_hub import snapshot_download
path = snapshot_download(repo_id="UkrainianCatholicUniversity/rukopys", repo_type="dataset")
# All files under `path` in the original folder structure (train/, silver/, test/)
```
---
## Annotation Schema
Each record in `train` and `silver` has a `regions` field — a list of annotated content regions:
```json
{
"file_name": "images/abc123.jpg",
"image_width": 3024,
"image_height": 4032,
"source": "dictation",
"annotation_source": "annotator",
"regions": [
{
"bbox": [134, 766, 3754, 1197],
"type": "handwritten",
"language": "uk",
"legibility": "legible",
"text": "Спочатку був брехунець. У нього кожного дня: „Клац!""
}
]
}
```
`bbox` format: `[x1, y1, x2, y2]` — pixel coordinates, top-left origin.
### Region Types
| Type | Description | Transcription |
|------|-------------|---------------|
| `handwritten` | Handwritten text line | Exact text, 1 bbox = 1 line |
| `printed` | Printed/typed text line | Exact text, 1 bbox = 1 line |
| `formula` | Standalone math/chemistry expression | LaTeX |
| `table` | Full table | Pipe-separated values |
| `annotation` | Teacher marks, grades, numbering | Short text |
| `image` | Stamps, seals, drawings | Empty |
| `graph` | Charts, plots | Empty |
### Special Text Markers
| Marker | Meaning |
|--------|---------|
| `~~word~~` | Strikethrough text |
| `~~old~~{new}` | Strikethrough with correction |
| `[illegible]` | Unreadable word within a legible line |
### Region Attributes
| Attribute | Values |
|-----------|--------|
| `language` | `uk`, `other` |
| `legibility` | `legible`, `illegible` |
| `annotation_source` | `annotator`, `volunteer`, `auto` |
`annotation_source` values:
| Value | Meaning |
|-------|---------|
| `annotator` | Labeled by [Keymakr](https://keymakr.com/) — professional human annotation service |
| `volunteer` | Labeled by community volunteers; spot-checked for quality |
| `auto` | Auto-generated by the VLM pipeline (silver split only) |
---
## Anti-Leakage Design
| Source | Train | Test | Guarantee |
|--------|-------|------|-----------|
| **Dictation** | Year 2024 | Years 2020, 2022, 2025 | Different canonical texts |
| **Archive** | Archival file set A | Archival file set B | Non-overlapping archival document sets |
| **University** | Exam PDF group A | Exam PDF group B | Different students' exam files |
| **School** | Grades 5, 6, 7, 9, 11 | Grades 8, 10 | Different grade bands |
---
## Silver Split
The `silver` split contains 8,210 auto-annotated images generated by a multi-stage VLM pipeline:
```
Stage 1: Qwen3-VL 8B block detection
Stage 2: Gemini Flash block classification
Stage 3: Qwen3-VL 8B line segmentation within text blocks
Stage 4: Gemini Flash transcription
```
Known limitations: bbox sequence drift on dense text; axis-aligned boxes may clip skewed lines; ~440 archive files contain mixed Ukrainian/Russian text from the 1919–1935 period.
---
## Acknowledgements
Professional annotation was provided by [Keymakr](https://keymakr.com/), a human-in-the-loop data annotation company.
Additional annotations were contributed by volunteers. The full list of contributors will be published shortly. All volunteer annotations underwent spot-checking for quality assurance.
All images were reviewed prior to publication to remove personally identifiable information (PII).
---
## Roadmap
This is the first public release of RUKOPYS. The dataset will grow incrementally — both through additional sources and through expanded coverage of existing ones.
We welcome collaboration from:
- **Annotators** interested in contributing human-verified labels
- **Researchers** working on better automatic annotation approaches (layout analysis, HTR pre-annotation, active learning)
If you'd like to contribute, reach out via the [Kaggle competition forum](https://www.kaggle.com/competitions/handwritten-to-data/discussion) or open an issue on HuggingFace.
---
## Potential Uses
- Fine-tune HTR models on `train`, evaluate on `test` via the [Kaggle competition](https://www.kaggle.com/competitions/handwritten-to-data)
- Pseudo-labeling: GT text for each dictation year is publicly known — use it for text-line alignment
- Self-training / semi-supervised learning with the `silver` split
- Multi-source domain adaptation (modern handwriting → historical documents)
---
## License
**CC BY-NC-SA 4.0** — Attribution, Non-Commercial, Share-Alike.
- **National Dictation** images: provided under a data sharing agreement for academic research and publication
- **State Archive** (ЦДАВО): provided under a data sharing agreement for academic research and publication
- **KNUTE** and **Opornyi Lyceum s. Zymne (Опорний ліцей с. Зимне)**: provided under data sharing agreements for academic research and publication
---
## Citation
```bibtex
@dataset{rukopys_2026,
title = {{RUKOPYS}: Ukrainian Handwritten Text Recognition Dataset},
author = {Dmytro Voitekh and Volodymyr Zmiivskyyi and Oleksii Molchanovskyi},
organization = {Ukrainian Catholic University},
year = {2026},
license = {CC BY-NC-SA 4.0},
url = {https://huggingface.co/UkrainianCatholicUniversity/rukopys},
note = {First large-scale Ukrainian HTR dataset; from 1920s archival documents to 2025 school homework and exams}
}
```
提供机构:
swswswswsw



