nagohachi/japanese-str-dataset-test
收藏Hugging Face2026-01-04 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nagohachi/japanese-str-dataset-test
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- image-to-text
language:
- ja
tags:
- ocr
- webdataset
size_categories:
- 1M<n<10M
---
# OCR Dataset
Japanese OCR dataset in WebDataset format.
## Dataset Structure
| Split | Samples | Shards |
|-------|---------|--------|
| train | 5,000,000 | 500 |
| valid | 50,000 | 5 |
| test | 50,000 | 5 |
| **Total** | **5,100,000** | **510** |
## Usage
```python
import webdataset as wds
base_url = "https://huggingface.co/datasets/nagohachi/japanese-str-dataset-test/resolve/main"
# Load train split
train_dataset = (
wds.WebDataset(base_url + "/train/train-{00000..00499}.tar")
.decode("pil")
.to_tuple("png", "txt")
)
for image, text in train_dataset:
# image: PIL Image
# text: str
pass
```
## Format
Each sample contains:
- `png`: Image file (PNG format)
- `txt`: Ground truth text
提供机构:
nagohachi



