Pritosh/odia-ocr-synth-v2
收藏Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Pritosh/odia-ocr-synth-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- or
license: mit
task_categories:
- image-to-text
tags:
- ocr
- odia
- oriya
- synthetic
- text-recognition
pretty_name: Odia OCR Synthetic Training Data v2
size_categories:
- 100K<n<1M
---
# Odia OCR Synthetic Training Data v2
200,000 synthetic image-text pairs for training OCR models on **Odia (ଓଡ଼ିଆ)** script. Major upgrade over [v1](https://huggingface.co/datasets/Pritosh/odia-ocr-synth) with 29 font variants, realistic backgrounds, and diverse text types.
## What's New in v2
| Feature | v1 | v2 |
|---------|----|----|
| Samples | 100,000 | **200,000** |
| Font variants | 9 | **29** (9 families with weight variants) |
| Backgrounds | Solid only | **5 types**: solid, clean, paper texture, degraded, colored |
| DPI | Single | **5 levels**: 72, 96, 150, 200, 300 |
| Text modes | 5 | **9**: + numbers, rare conjuncts, punctuation-heavy, document pages |
| Size | 5.5 GB | **~21 GB** |
## Dataset Description
### Font Families (29 variants)
- Noto Sans Oriya (Thin, Light, Regular, Medium, Bold, Black)
- Noto Serif Oriya (Light, Regular, Bold)
- Anek Odia (Light, Regular, SemiBold, Bold, ExtraBold)
- Baloo Bhaina 2 (Regular, Medium, SemiBold, Bold, ExtraBold)
- Alkatra (Regular, Medium, SemiBold, Bold)
- Lohit Odia, Odia OT Jagannatha, Sakal Bharati, LipiPragatuchhi
- Nirmala UI (Regular, Bold)
### Text Sources
- Odia Wikipedia (12,825 articles, 15.6M chars)
- Purnachandra Dictionary (146,222 definitions)
- NIOS textbooks (27M+ Odia chars)
- Wikisource literature (1.28M chars)
- Engineering/science glossaries (47,523 terms)
### Sampling Modes
| Mode | % | Description |
|------|---|-------------|
| sentence | 25% | Full sentences from Wikipedia/textbooks |
| word | 15% | Single dictionary words |
| phrase | 15% | 2-5 consecutive words |
| paragraph | 15% | Multi-sentence blocks |
| mixed | 10% | Bilingual Odia-English pairs |
| punctuation_heavy | 5% | Quotes, parentheses, special chars |
| rare_conjuncts | 5% | ଷ୍ଟ, ଶ୍ୱ, କ୍ଷ, ଜ୍ଞ and similar |
| numbers | 5% | Dates, prices, phone numbers in Odia numerals |
| document_page | 5% | Full document-style layouts |
### Background Types
- **Solid** (50%): Clean solid color backgrounds
- **Clean** (20%): Near-solid with slight variation
- **Paper texture** (15%): Yellowed/aged paper simulation
- **Degraded** (10%): Photocopy/scan artifacts
- **Colored** (5%): Colored text on colored backgrounds
### Augmentations
Gaussian noise, blur, rotation, perspective skew, JPEG compression, ink bleed, brightness/contrast variation — applied in random combinations.
## Dataset Structure
Data is stored as 40 tar.gz shards in `data/`, each containing 5,000 image-text pairs:
```
data/
shard_000000_004999.tar.gz
shard_005000_009999.tar.gz
...
shard_195000_199999.tar.gz
metadata.csv
```
Each shard extracts to:
```
images/NNNNNNN.png
ground_truth/NNNNNNN.gt.txt
```
### metadata.csv columns
`image_path, gt_path, text, sample_mode, font_name, font_size, bg_color, bg_type, text_color, line_spacing, dpi, augmentations, image_width, image_height, num_chars, num_words`
## Usage
### Download & Extract All Shards
```python
from huggingface_hub import hf_hub_download, list_repo_files
import tarfile, os
repo_id = "Pritosh/odia-ocr-synth-v2"
output_dir = "./odia_ocr_data_v2"
os.makedirs(output_dir, exist_ok=True)
files = [f for f in list_repo_files(repo_id, repo_type="dataset") if f.endswith(".tar.gz")]
for f in sorted(files):
print(f"Downloading {f}...")
path = hf_hub_download(repo_id=repo_id, filename=f, repo_type="dataset")
with tarfile.open(path, "r:gz") as tar:
tar.extractall(output_dir)
print(f"Done! {len(os.listdir(os.path.join(output_dir, 'images')))} images ready.")
```
### Load a Single Shard
```python
from huggingface_hub import hf_hub_download
import tarfile
path = hf_hub_download(
repo_id="Pritosh/odia-ocr-synth-v2",
filename="data/shard_000000_004999.tar.gz",
repo_type="dataset",
)
with tarfile.open(path, "r:gz") as tar:
tar.extractall("./sample_data")
```
### PyTorch DataLoader
```python
from torch.utils.data import Dataset
from PIL import Image
import os
class OdiaOCRDataset(Dataset):
def __init__(self, root_dir, transform=None):
self.root_dir = root_dir
self.transform = transform
self.images = sorted(os.listdir(os.path.join(root_dir, "images")))
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
img_name = self.images[idx]
img_path = os.path.join(self.root_dir, "images", img_name)
gt_path = os.path.join(self.root_dir, "ground_truth",
img_name.replace(".png", ".gt.txt"))
image = Image.open(img_path).convert("RGB")
with open(gt_path, "r", encoding="utf-8") as f:
text = f.read().strip()
if self.transform:
image = self.transform(image)
return image, text
dataset = OdiaOCRDataset("./odia_ocr_data_v2")
```
## License
MIT (code) | Font licenses: SIL OFL 1.1 / Apache 2.0
提供机构:
Pritosh



