Pritosh/odia-ocr-synth-v2

Name: Pritosh/odia-ocr-synth-v2
Creator: Pritosh
Published: 2026-03-27 08:46:22
License: 暂无描述

Hugging Face2026-03-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Pritosh/odia-ocr-synth-v2

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - or license: mit task_categories: - image-to-text tags: - ocr - odia - oriya - synthetic - text-recognition pretty_name: Odia OCR Synthetic Training Data v2 size_categories: - 100K<n<1M --- # Odia OCR Synthetic Training Data v2 200,000 synthetic image-text pairs for training OCR models on **Odia (ଓଡ଼ିଆ)** script. Major upgrade over [v1](https://huggingface.co/datasets/Pritosh/odia-ocr-synth) with 29 font variants, realistic backgrounds, and diverse text types. ## What's New in v2 | Feature | v1 | v2 | |---------|----|----| | Samples | 100,000 | **200,000** | | Font variants | 9 | **29** (9 families with weight variants) | | Backgrounds | Solid only | **5 types**: solid, clean, paper texture, degraded, colored | | DPI | Single | **5 levels**: 72, 96, 150, 200, 300 | | Text modes | 5 | **9**: + numbers, rare conjuncts, punctuation-heavy, document pages | | Size | 5.5 GB | **~21 GB** | ## Dataset Description ### Font Families (29 variants) - Noto Sans Oriya (Thin, Light, Regular, Medium, Bold, Black) - Noto Serif Oriya (Light, Regular, Bold) - Anek Odia (Light, Regular, SemiBold, Bold, ExtraBold) - Baloo Bhaina 2 (Regular, Medium, SemiBold, Bold, ExtraBold) - Alkatra (Regular, Medium, SemiBold, Bold) - Lohit Odia, Odia OT Jagannatha, Sakal Bharati, LipiPragatuchhi - Nirmala UI (Regular, Bold) ### Text Sources - Odia Wikipedia (12,825 articles, 15.6M chars) - Purnachandra Dictionary (146,222 definitions) - NIOS textbooks (27M+ Odia chars) - Wikisource literature (1.28M chars) - Engineering/science glossaries (47,523 terms) ### Sampling Modes | Mode | % | Description | |------|---|-------------| | sentence | 25% | Full sentences from Wikipedia/textbooks | | word | 15% | Single dictionary words | | phrase | 15% | 2-5 consecutive words | | paragraph | 15% | Multi-sentence blocks | | mixed | 10% | Bilingual Odia-English pairs | | punctuation_heavy | 5% | Quotes, parentheses, special chars | | rare_conjuncts | 5% | ଷ୍ଟ, ଶ୍ୱ, କ୍ଷ, ଜ୍ଞ and similar | | numbers | 5% | Dates, prices, phone numbers in Odia numerals | | document_page | 5% | Full document-style layouts | ### Background Types - **Solid** (50%): Clean solid color backgrounds - **Clean** (20%): Near-solid with slight variation - **Paper texture** (15%): Yellowed/aged paper simulation - **Degraded** (10%): Photocopy/scan artifacts - **Colored** (5%): Colored text on colored backgrounds ### Augmentations Gaussian noise, blur, rotation, perspective skew, JPEG compression, ink bleed, brightness/contrast variation — applied in random combinations. ## Dataset Structure Data is stored as 40 tar.gz shards in `data/`, each containing 5,000 image-text pairs: ``` data/ shard_000000_004999.tar.gz shard_005000_009999.tar.gz ... shard_195000_199999.tar.gz metadata.csv ``` Each shard extracts to: ``` images/NNNNNNN.png ground_truth/NNNNNNN.gt.txt ``` ### metadata.csv columns `image_path, gt_path, text, sample_mode, font_name, font_size, bg_color, bg_type, text_color, line_spacing, dpi, augmentations, image_width, image_height, num_chars, num_words` ## Usage ### Download & Extract All Shards ```python from huggingface_hub import hf_hub_download, list_repo_files import tarfile, os repo_id = "Pritosh/odia-ocr-synth-v2" output_dir = "./odia_ocr_data_v2" os.makedirs(output_dir, exist_ok=True) files = [f for f in list_repo_files(repo_id, repo_type="dataset") if f.endswith(".tar.gz")] for f in sorted(files): print(f"Downloading {f}...") path = hf_hub_download(repo_id=repo_id, filename=f, repo_type="dataset") with tarfile.open(path, "r:gz") as tar: tar.extractall(output_dir) print(f"Done! {len(os.listdir(os.path.join(output_dir, 'images')))} images ready.") ``` ### Load a Single Shard ```python from huggingface_hub import hf_hub_download import tarfile path = hf_hub_download( repo_id="Pritosh/odia-ocr-synth-v2", filename="data/shard_000000_004999.tar.gz", repo_type="dataset", ) with tarfile.open(path, "r:gz") as tar: tar.extractall("./sample_data") ``` ### PyTorch DataLoader ```python from torch.utils.data import Dataset from PIL import Image import os class OdiaOCRDataset(Dataset): def __init__(self, root_dir, transform=None): self.root_dir = root_dir self.transform = transform self.images = sorted(os.listdir(os.path.join(root_dir, "images"))) def __len__(self): return len(self.images) def __getitem__(self, idx): img_name = self.images[idx] img_path = os.path.join(self.root_dir, "images", img_name) gt_path = os.path.join(self.root_dir, "ground_truth", img_name.replace(".png", ".gt.txt")) image = Image.open(img_path).convert("RGB") with open(gt_path, "r", encoding="utf-8") as f: text = f.read().strip() if self.transform: image = self.transform(image) return image, text dataset = OdiaOCRDataset("./odia_ocr_data_v2") ``` ## License MIT (code) | Font licenses: SIL OFL 1.1 / Apache 2.0

提供机构：

Pritosh

5,000+

优质数据集

54 个

任务类型

进入经典数据集