nagohachi/japanese-str-dataset-v1

Name: nagohachi/japanese-str-dataset-v1
Creator: nagohachi
Published: 2026-01-05 10:41:39
License: 暂无描述

Hugging Face2026-01-05 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/nagohachi/japanese-str-dataset-v1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - image-to-text language: - ja tags: - ocr - webdataset size_categories: - 1M<n<10M --- # STR Dataset Japanese STR (Scene Text Recognition) dataset in WebDataset format. This dataset is composed of: - Images of Japanese named entities (full names and their affiliations) - Images of sentences retrieved from Aozora Bunko (青空文庫) and their corresponding ground truth texts. All images are synthesized using [TRDG](https://github.com/Belval/TextRecognitionDataGenerator). ## Dataset Structure | Split | Samples | Shards | |-------|---------|--------| | train | 10,000,000 | 1000 | | valid | 50,000 | 5 | | test | 50,000 | 5 | | **Total** | **10,100,000** | **1010** | ## Usage ```python import webdataset as wds base_url = "https://huggingface.co/datasets/nagohachi/japanese-str-dataset-v1/resolve/main" # Load train split train_dataset = ( wds.WebDataset(base_url + "/train/train-{00000..00999}.tar") .decode("pil") .to_tuple("png", "txt") ) for image, text in train_dataset: # image: PIL Image # text: str pass ``` ## Format Each sample contains: - `png`: Image file (PNG format) - `txt`: Ground truth text

提供机构：

nagohachi

5,000+

优质数据集

54 个

任务类型

进入经典数据集