five

darknight054/indic-mozhi-ocr

收藏
Hugging Face2026-01-28 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/darknight054/indic-mozhi-ocr
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: assamese features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 270059621 num_examples: 79697 - name: validation num_bytes: 35560878 num_examples: 9945 - name: test num_bytes: 35888127 num_examples: 10146 download_size: 294871418 dataset_size: 341508626 - config_name: bengali features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 265996304 num_examples: 80113 - name: validation num_bytes: 28722853 num_examples: 9787 - name: test num_bytes: 30064081 num_examples: 10113 download_size: 296946389 dataset_size: 324783238 - config_name: gujarati features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 218777410 num_examples: 79910 - name: validation num_bytes: 27306814 num_examples: 10016 - name: test num_bytes: 28092137 num_examples: 10090 download_size: 277921132 dataset_size: 274176361 - config_name: hindi features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 199420825 num_examples: 79762 - name: validation num_bytes: 25265046 num_examples: 10114 - name: test num_bytes: 25412509 num_examples: 10173 download_size: 201143766 dataset_size: 250098380 - config_name: kannada features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 402135202 num_examples: 80085 - name: validation num_bytes: 52843553 num_examples: 10088 - name: test num_bytes: 51026236 num_examples: 9838 download_size: 443475443 dataset_size: 506004991 - config_name: malayalam features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 546769073 num_examples: 80146 - name: validation num_bytes: 66833736 num_examples: 9893 - name: test num_bytes: 69144765 num_examples: 9980 download_size: 647730561 dataset_size: 682747574 - config_name: manipuri features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 339469096 num_examples: 79691 - name: validation num_bytes: 40930290 num_examples: 10254 - name: test num_bytes: 39848562 num_examples: 10061 download_size: 371291787 dataset_size: 420247948 - config_name: marathi features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 271533031 num_examples: 80151 - name: validation num_bytes: 37502752 num_examples: 10005 - name: test num_bytes: 38640750 num_examples: 9855 download_size: 327539664 dataset_size: 347676533 - config_name: oriya features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 324316449 num_examples: 79945 - name: validation num_bytes: 41542702 num_examples: 10089 - name: test num_bytes: 41599784 num_examples: 9994 download_size: 371710412 dataset_size: 407458935 - config_name: punjabi features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 230904932 num_examples: 79931 - name: validation num_bytes: 29100311 num_examples: 10036 - name: test num_bytes: 28453274 num_examples: 10038 download_size: 233638413 dataset_size: 288458517 - config_name: tamil features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 509878449 num_examples: 80022 - name: validation num_bytes: 60254676 num_examples: 10021 - name: test num_bytes: 58630158 num_examples: 9974 download_size: 575013641 dataset_size: 628763283 - config_name: telugu features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 364605968 num_examples: 80337 - name: validation num_bytes: 46625909 num_examples: 9811 - name: test num_bytes: 45746874 num_examples: 9876 download_size: 419609545 dataset_size: 456978751 - config_name: urdu features: - name: id dtype: string - name: image dtype: image - name: text dtype: string splits: - name: train num_bytes: 102375706 num_examples: 9100 - name: validation num_bytes: 12978377 num_examples: 1138 - name: test num_bytes: 13039498 num_examples: 1137 download_size: 128508755 dataset_size: 128393581 configs: - config_name: assamese data_files: - split: train path: assamese/train-* - split: validation path: assamese/validation-* - split: test path: assamese/test-* - config_name: bengali data_files: - split: train path: bengali/train-* - split: validation path: bengali/validation-* - split: test path: bengali/test-* - config_name: gujarati data_files: - split: train path: gujarati/train-* - split: validation path: gujarati/validation-* - split: test path: gujarati/test-* - config_name: hindi data_files: - split: train path: hindi/train-* - split: validation path: hindi/validation-* - split: test path: hindi/test-* - config_name: kannada data_files: - split: train path: kannada/train-* - split: validation path: kannada/validation-* - split: test path: kannada/test-* - config_name: malayalam data_files: - split: train path: malayalam/train-* - split: validation path: malayalam/validation-* - split: test path: malayalam/test-* - config_name: manipuri data_files: - split: train path: manipuri/train-* - split: validation path: manipuri/validation-* - split: test path: manipuri/test-* - config_name: marathi data_files: - split: train path: marathi/train-* - split: validation path: marathi/validation-* - split: test path: marathi/test-* - config_name: oriya data_files: - split: train path: oriya/train-* - split: validation path: oriya/validation-* - split: test path: oriya/test-* - config_name: punjabi data_files: - split: train path: punjabi/train-* - split: validation path: punjabi/validation-* - split: test path: punjabi/test-* - config_name: tamil data_files: - split: train path: tamil/train-* - split: validation path: tamil/validation-* - split: test path: tamil/test-* - config_name: telugu data_files: - split: train path: telugu/train-* - split: validation path: telugu/validation-* - split: test path: telugu/test-* - config_name: urdu data_files: - split: train path: urdu/train-* - split: validation path: urdu/validation-* - split: test path: urdu/test-* language: - as - bn - gu - hi - mr - kn - ml - or - pa - ta - te - ur tags: - ocr size_categories: - 1M<n<10M --- # Mozhi (Printed Word Images) - Indic OCR Dataset This folder contains the **word-level printed OCR dataset** downloaded from the CVIT USODI project page for **"Towards Deployable OCR Models for Indic Languages"**. The data is organized by language and split (train/val/test) and is intended for upload to Hugging Face. ## Source Source page: https://cvit.iiit.ac.in/usodi/tdocrmil.php **Paper:** Towards Deployable OCR Models for Indic Languages **Authors:** Minesh Mathew, Ajoy Mondal, C V Jawahar **Conference:** International Conference on Pattern Recognition (ICPR) ## Languages The dataset provides word images for 13 languages: - Assamese - Bengali - Gujarati - Hindi - Kannada - Malayalam - Manipuri - Marathi - Oriya (Odia) - Punjabi - Tamil - Telugu - Urdu ## Structure ``` raw_data_2/ ├── assamese/ │ ├── train/ │ │ ├── images/ │ │ ├── train_gt.txt │ │ └── vocabulary.txt │ ├── val/ │ │ ├── images/ │ │ └── val_gt.txt │ └── test/ │ ├── images/ │ └── test_gt.txt ├── bengali/ ├── gujarati/ ├── hindi/ ├── kannada/ ├── malayalam/ ├── manipuri/ ├── marathi/ ├── oriya/ ├── punjabi/ ├── tamil/ ├── telugu/ └── urdu/ ``` Note: `vocabulary.txt` may be present in the train split for some languages. ## Transcriptions (`*_gt.txt`) Each line in `train_gt.txt`, `val_gt.txt`, and `test_gt.txt` is **tab-separated**: ``` images/[image_file].jpeg text ``` - `images/[image_file].jpeg` is the path to the word image relative to the split directory. - `text` is the Unicode transcription for that image. ## Citation If you use this dataset in your research, please cite: ```bibtex @inproceedings{mathew2025towards, title={Towards Deployable OCR Models for Indic Languages}, author={Mathew, Minesh and Mondal, Ajoy and Jawahar, CV}, booktitle={International Conference on Pattern Recognition}, pages={167--182}, year={2025}, organization={Springer} } ``` ## Acknowledgements This work is supported by MeitY, Government of India, through the NLTMBhashini project. ## License Please refer to the source page for licensing information.
提供机构:
darknight054
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作