Pritosh/odia-ocr-synth

Name: Pritosh/odia-ocr-synth
Creator: Pritosh
Published: 2026-03-26 14:35:23
License: 暂无描述

Hugging Face2026-03-26 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Pritosh/odia-ocr-synth

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - or license: mit task_categories: - image-to-text tags: - ocr - odia - oriya - synthetic - text-recognition pretty_name: Odia OCR Synthetic Training Data size_categories: - 10K<n<100K --- # Odia OCR Synthetic Training Data 100,000 synthetic image-text pairs for training OCR models on **Odia (ଓଡ଼ିଆ)** script. ## Dataset Description Each sample consists of: - A PNG image containing rendered Odia text - A ground truth `.gt.txt` file with the corresponding Unicode text ### Generation Details - **Font families**: 9 (Noto Sans/Serif Oriya, Anek Odia, Baloo Bhaina 2, Lohit Odia, Odia OT Jagannatha, Sakal Bharati, Nirmala UI) - **Font variants**: 24 (different weights: Thin to ExtraBold) - **Text sources**: Odia Wikipedia (91M chars), Purnachandra Dictionary (146K definitions), NIOS textbooks, Wikisource literature, engineering/science glossaries - **Augmentations**: Gaussian noise, blur, rotation, perspective skew, JPEG compression, ink bleed, brightness/contrast variation - **Difficulty**: Medium (balanced augmentation for realistic training) ### Sampling Modes | Mode | Proportion | Description | |------|-----------|-------------| | word | 20% | Single dictionary words | | phrase | 20% | 2-5 consecutive words | | sentence | 30% | Full sentences | | paragraph | 15% | Multi-sentence blocks | | mixed | 15% | Bilingual Odia-English pairs | ## Dataset Structure ``` images/ # 100,000 PNG files (0000000.png to 0099999.png) ground_truth/ # 100,000 text files (0000000.gt.txt to 0099999.gt.txt) metadata.csv # Full generation metadata ``` ### metadata.csv columns `image_path, gt_path, text, sample_mode, font_name, font_size, bg_color, text_color, line_spacing, augmentations, image_width, image_height, num_chars, num_words` ## Dataset Structure (Sharded) The data is stored as 20 tar.gz shards in `data/`, each containing 5,000 image-text pairs: ``` data/ shard_000000_004999.tar.gz # samples 0-4999 shard_005000_009999.tar.gz # samples 5000-9999 ... shard_095000_099999.tar.gz # samples 95000-99999 metadata.csv # Full generation metadata ``` Each shard extracts to: ``` images/0000000.png ... 0004999.png ground_truth/0000000.gt.txt ... 0004999.gt.txt ``` ## Usage Compatible with Tesseract, Kraken, EasyOCR, PaddleOCR, and deep learning frameworks. ### Download & Extract All Shards ```python from huggingface_hub import hf_hub_download, list_repo_files import tarfile, os repo_id = "Pritosh/odia-ocr-synth" output_dir = "./odia_ocr_data" os.makedirs(output_dir, exist_ok=True) # Find all shard files files = [f for f in list_repo_files(repo_id, repo_type="dataset") if f.endswith(".tar.gz")] for f in sorted(files): print(f"Downloading {f}...") path = hf_hub_download(repo_id=repo_id, filename=f, repo_type="dataset") with tarfile.open(path, "r:gz") as tar: tar.extractall(output_dir) print(f" Extracted to {output_dir}") print(f"Done! {len(os.listdir(os.path.join(output_dir, 'images')))} images ready.") ``` ### Load a Single Shard ```python from huggingface_hub import hf_hub_download import tarfile path = hf_hub_download( repo_id="Pritosh/odia-ocr-synth", filename="data/shard_000000_004999.tar.gz", repo_type="dataset", ) with tarfile.open(path, "r:gz") as tar: tar.extractall("./sample_data") ``` ### Use with PyTorch ```python from torch.utils.data import Dataset from PIL import Image import os class OdiaOCRDataset(Dataset): def __init__(self, root_dir, transform=None): self.root_dir = root_dir self.transform = transform self.images = sorted(os.listdir(os.path.join(root_dir, "images"))) def __len__(self): return len(self.images) def __getitem__(self, idx): img_name = self.images[idx] img_path = os.path.join(self.root_dir, "images", img_name) gt_path = os.path.join(self.root_dir, "ground_truth", img_name.replace(".png", ".gt.txt")) image = Image.open(img_path).convert("RGB") with open(gt_path, "r", encoding="utf-8") as f: text = f.read().strip() if self.transform: image = self.transform(image) return image, text dataset = OdiaOCRDataset("./odia_ocr_data") ``` ### Tesseract Training ```bash # Extract all shards first, then: tesseract image.png output --oem 1 --psm 6 -l ori ``` ## Generation Tool Generated using [odia-ocr-synth](https://github.com/pritoshkumar/odia-ocr-synth) — open source tool for generating synthetic Odia OCR training data. ## License MIT (code) | Font licenses: SIL OFL 1.1 / Apache 2.0

提供机构：

Pritosh

5,000+

优质数据集

54 个

任务类型

进入经典数据集