five

nagohachi/japanese-str-dataset-v1

收藏
Hugging Face2026-01-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nagohachi/japanese-str-dataset-v1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - image-to-text language: - ja tags: - ocr - webdataset size_categories: - 1M<n<10M --- # STR Dataset Japanese STR (Scene Text Recognition) dataset in WebDataset format. This dataset is composed of: - Images of Japanese named entities (full names and their affiliations) - Images of sentences retrieved from Aozora Bunko (青空文庫) and their corresponding ground truth texts. All images are synthesized using [TRDG](https://github.com/Belval/TextRecognitionDataGenerator). ## Dataset Structure | Split | Samples | Shards | |-------|---------|--------| | train | 10,000,000 | 1000 | | valid | 50,000 | 5 | | test | 50,000 | 5 | | **Total** | **10,100,000** | **1010** | ## Usage ```python import webdataset as wds base_url = "https://huggingface.co/datasets/nagohachi/japanese-str-dataset-v1/resolve/main" # Load train split train_dataset = ( wds.WebDataset(base_url + "/train/train-{00000..00999}.tar") .decode("pil") .to_tuple("png", "txt") ) for image, text in train_dataset: # image: PIL Image # text: str pass ``` ## Format Each sample contains: - `png`: Image file (PNG format) - `txt`: Ground truth text
提供机构:
nagohachi
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作