five

Timka28/cyrillic_small

收藏
Hugging Face2025-12-15 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Timka28/cyrillic_small
下载链接
链接失效反馈
官方服务:
资源简介:
Cyrillic Handwriting Mixed Dataset (small) 是一个平衡的混合西里尔手写文本数据集的子集,适用于手写识别(HWR)、光学字符识别(OCR)、多模态模型和快速实验。数据集包含37,831个样本,分为训练集(32,299个样本)和测试集(5,532个样本)。每个样本包含图像(datasets.Image / Pillow格式)、文本转录、数据集来源、最终拆分(train/test)以及原始数据集中的拆分(如果存在)。数据集来源包括多个公开数据集,如cyrillic-handwriting-dataset、handwritten_ru_letters、comnist等,并遵循特定的训练/测试拆分规则。文本统计显示最小长度为1个字符,最大长度为3,730个字符,平均长度为87.1个字符,中位数为7个字符。

The Cyrillic Handwriting Mixed Dataset (small) is a balanced subset of combined Cyrillic handwritten text datasets, suitable for HWR/OCR, multimodal models, and quick experiments. It contains 37,831 samples, split into train (32,299 samples) and test (5,532 samples). Each sample includes an image (datasets.Image / Pillow format), text transcription, dataset source, final split (train/test), and the original dataset split (if available). The dataset is compiled from multiple public sources, such as cyrillic-handwriting-dataset, handwritten_ru_letters, comnist, etc., with specific train/test split rules. Text statistics show a minimum length of 1 character, maximum length of 3,730 characters, average length of 87.1 characters, and median of 7 characters.
提供机构:
Timka28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作