nagohachi/japanese-str-dataset-v1
收藏Hugging Face2026-01-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nagohachi/japanese-str-dataset-v1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- image-to-text
language:
- ja
tags:
- ocr
- webdataset
size_categories:
- 1M<n<10M
---
# STR Dataset
Japanese STR (Scene Text Recognition) dataset in WebDataset format.
This dataset is composed of:
- Images of Japanese named entities (full names and their affiliations)
- Images of sentences retrieved from Aozora Bunko (青空文庫)
and their corresponding ground truth texts.
All images are synthesized using [TRDG](https://github.com/Belval/TextRecognitionDataGenerator).
## Dataset Structure
| Split | Samples | Shards |
|-------|---------|--------|
| train | 10,000,000 | 1000 |
| valid | 50,000 | 5 |
| test | 50,000 | 5 |
| **Total** | **10,100,000** | **1010** |
## Usage
```python
import webdataset as wds
base_url = "https://huggingface.co/datasets/nagohachi/japanese-str-dataset-v1/resolve/main"
# Load train split
train_dataset = (
wds.WebDataset(base_url + "/train/train-{00000..00999}.tar")
.decode("pil")
.to_tuple("png", "txt")
)
for image, text in train_dataset:
# image: PIL Image
# text: str
pass
```
## Format
Each sample contains:
- `png`: Image file (PNG format)
- `txt`: Ground truth text
提供机构:
nagohachi



