darknight054/indic-mozhi-ocr
收藏Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/darknight054/indic-mozhi-ocr
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: assamese
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 270059621
num_examples: 79697
- name: validation
num_bytes: 35560878
num_examples: 9945
- name: test
num_bytes: 35888127
num_examples: 10146
download_size: 294871418
dataset_size: 341508626
- config_name: bengali
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 265996304
num_examples: 80113
- name: validation
num_bytes: 28722853
num_examples: 9787
- name: test
num_bytes: 30064081
num_examples: 10113
download_size: 296946389
dataset_size: 324783238
- config_name: gujarati
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 218777410
num_examples: 79910
- name: validation
num_bytes: 27306814
num_examples: 10016
- name: test
num_bytes: 28092137
num_examples: 10090
download_size: 277921132
dataset_size: 274176361
- config_name: hindi
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 199420825
num_examples: 79762
- name: validation
num_bytes: 25265046
num_examples: 10114
- name: test
num_bytes: 25412509
num_examples: 10173
download_size: 201143766
dataset_size: 250098380
- config_name: kannada
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 402135202
num_examples: 80085
- name: validation
num_bytes: 52843553
num_examples: 10088
- name: test
num_bytes: 51026236
num_examples: 9838
download_size: 443475443
dataset_size: 506004991
- config_name: malayalam
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 546769073
num_examples: 80146
- name: validation
num_bytes: 66833736
num_examples: 9893
- name: test
num_bytes: 69144765
num_examples: 9980
download_size: 647730561
dataset_size: 682747574
- config_name: manipuri
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 339469096
num_examples: 79691
- name: validation
num_bytes: 40930290
num_examples: 10254
- name: test
num_bytes: 39848562
num_examples: 10061
download_size: 371291787
dataset_size: 420247948
- config_name: marathi
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 271533031
num_examples: 80151
- name: validation
num_bytes: 37502752
num_examples: 10005
- name: test
num_bytes: 38640750
num_examples: 9855
download_size: 327539664
dataset_size: 347676533
- config_name: oriya
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 324316449
num_examples: 79945
- name: validation
num_bytes: 41542702
num_examples: 10089
- name: test
num_bytes: 41599784
num_examples: 9994
download_size: 371710412
dataset_size: 407458935
- config_name: punjabi
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 230904932
num_examples: 79931
- name: validation
num_bytes: 29100311
num_examples: 10036
- name: test
num_bytes: 28453274
num_examples: 10038
download_size: 233638413
dataset_size: 288458517
- config_name: tamil
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 509878449
num_examples: 80022
- name: validation
num_bytes: 60254676
num_examples: 10021
- name: test
num_bytes: 58630158
num_examples: 9974
download_size: 575013641
dataset_size: 628763283
- config_name: telugu
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 364605968
num_examples: 80337
- name: validation
num_bytes: 46625909
num_examples: 9811
- name: test
num_bytes: 45746874
num_examples: 9876
download_size: 419609545
dataset_size: 456978751
- config_name: urdu
features:
- name: id
dtype: string
- name: image
dtype: image
- name: text
dtype: string
splits:
- name: train
num_bytes: 102375706
num_examples: 9100
- name: validation
num_bytes: 12978377
num_examples: 1138
- name: test
num_bytes: 13039498
num_examples: 1137
download_size: 128508755
dataset_size: 128393581
configs:
- config_name: assamese
data_files:
- split: train
path: assamese/train-*
- split: validation
path: assamese/validation-*
- split: test
path: assamese/test-*
- config_name: bengali
data_files:
- split: train
path: bengali/train-*
- split: validation
path: bengali/validation-*
- split: test
path: bengali/test-*
- config_name: gujarati
data_files:
- split: train
path: gujarati/train-*
- split: validation
path: gujarati/validation-*
- split: test
path: gujarati/test-*
- config_name: hindi
data_files:
- split: train
path: hindi/train-*
- split: validation
path: hindi/validation-*
- split: test
path: hindi/test-*
- config_name: kannada
data_files:
- split: train
path: kannada/train-*
- split: validation
path: kannada/validation-*
- split: test
path: kannada/test-*
- config_name: malayalam
data_files:
- split: train
path: malayalam/train-*
- split: validation
path: malayalam/validation-*
- split: test
path: malayalam/test-*
- config_name: manipuri
data_files:
- split: train
path: manipuri/train-*
- split: validation
path: manipuri/validation-*
- split: test
path: manipuri/test-*
- config_name: marathi
data_files:
- split: train
path: marathi/train-*
- split: validation
path: marathi/validation-*
- split: test
path: marathi/test-*
- config_name: oriya
data_files:
- split: train
path: oriya/train-*
- split: validation
path: oriya/validation-*
- split: test
path: oriya/test-*
- config_name: punjabi
data_files:
- split: train
path: punjabi/train-*
- split: validation
path: punjabi/validation-*
- split: test
path: punjabi/test-*
- config_name: tamil
data_files:
- split: train
path: tamil/train-*
- split: validation
path: tamil/validation-*
- split: test
path: tamil/test-*
- config_name: telugu
data_files:
- split: train
path: telugu/train-*
- split: validation
path: telugu/validation-*
- split: test
path: telugu/test-*
- config_name: urdu
data_files:
- split: train
path: urdu/train-*
- split: validation
path: urdu/validation-*
- split: test
path: urdu/test-*
language:
- as
- bn
- gu
- hi
- mr
- kn
- ml
- or
- pa
- ta
- te
- ur
tags:
- ocr
size_categories:
- 1M<n<10M
---
# Mozhi (Printed Word Images) - Indic OCR Dataset
This folder contains the **word-level printed OCR dataset** downloaded from the CVIT USODI project page for
**"Towards Deployable OCR Models for Indic Languages"**. The data is organized by language and split
(train/val/test) and is intended for upload to Hugging Face.
## Source
Source page: https://cvit.iiit.ac.in/usodi/tdocrmil.php
**Paper:** Towards Deployable OCR Models for Indic Languages
**Authors:** Minesh Mathew, Ajoy Mondal, C V Jawahar
**Conference:** International Conference on Pattern Recognition (ICPR)
## Languages
The dataset provides word images for 13 languages:
- Assamese
- Bengali
- Gujarati
- Hindi
- Kannada
- Malayalam
- Manipuri
- Marathi
- Oriya (Odia)
- Punjabi
- Tamil
- Telugu
- Urdu
## Structure
```
raw_data_2/
├── assamese/
│ ├── train/
│ │ ├── images/
│ │ ├── train_gt.txt
│ │ └── vocabulary.txt
│ ├── val/
│ │ ├── images/
│ │ └── val_gt.txt
│ └── test/
│ ├── images/
│ └── test_gt.txt
├── bengali/
├── gujarati/
├── hindi/
├── kannada/
├── malayalam/
├── manipuri/
├── marathi/
├── oriya/
├── punjabi/
├── tamil/
├── telugu/
└── urdu/
```
Note: `vocabulary.txt` may be present in the train split for some languages.
## Transcriptions (`*_gt.txt`)
Each line in `train_gt.txt`, `val_gt.txt`, and `test_gt.txt` is **tab-separated**:
```
images/[image_file].jpeg text
```
- `images/[image_file].jpeg` is the path to the word image relative to the split directory.
- `text` is the Unicode transcription for that image.
## Citation
If you use this dataset in your research, please cite:
```bibtex
@inproceedings{mathew2025towards,
title={Towards Deployable OCR Models for Indic Languages},
author={Mathew, Minesh and Mondal, Ajoy and Jawahar, CV},
booktitle={International Conference on Pattern Recognition},
pages={167--182},
year={2025},
organization={Springer}
}
```
## Acknowledgements
This work is supported by MeitY, Government of India, through the NLTMBhashini project.
## License
Please refer to the source page for licensing information.
提供机构:
darknight054



