kalixlouiis/MyanmarOCR-ImageText

Name: kalixlouiis/MyanmarOCR-ImageText
Creator: kalixlouiis
Published: 2025-12-04 10:34:33
License: 暂无描述

Hugging Face2025-12-04 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/kalixlouiis/MyanmarOCR-ImageText

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: image dtype: image - name: text dtype: string - name: style dtype: string splits: - name: train num_bytes: 532905094.304 num_examples: 41664 download_size: 266331659 dataset_size: 532905094.304 configs: - config_name: default data_files: - split: train path: data/train-* language: - my size_categories: - 10K<n<100K license: cc-by-4.0 task_categories: - feature-extraction tags: - ocr - image-to-text pretty_name: MyanmarOCR-ImageText --- # 🇲🇲 MyanmarOCR-ImageText Dataset A clean and diverse **Burmese Image-to-Text** dataset for OCR and multimodal AI research. --- ## 📌 Summary - **Total images:** 41,664 - **Unique Burmese text entries:** 1,139 - **Styles per text:** 32 variations each - **Resolution:** 512 × 512 - **File types:** PNG/JPG images - **Dataset split:** train only - **Use cases:** OCR, I2T (image-to-text), VLM pretrain/fine-tune All text is **Burmese only**. No English words and no punctuation like: `? , ' " -` --- ## 🔡 Text Content Includes: - Common words & Pali words - Signs and short phrases - Full Myanmar Unicode support > Variety of Myanmar spellings and writings are included > (မြန်မာအက္ခရာတွေနဲ့ စကားလုံးမျိုးစုံပါဝင်ပါတယ်) --- ## 🎨 Style Variations Each text appears in **32 visual styles** with differences in: - font - color - texture - rotation (small angle) - background patterns This helps models generalize across real-world environments. --- ## 🧩 Data Format Each sample includes: | Column | Type | Description | |--------|------|-------------| | `image` | image | 512×512 Burmese rendered text | | `text` | string | Ground truth Burmese label | | `style` | string | Style ID (e.g., `style_01`) | Example record: ```json { "image": "<image>", "text": "မြန်မာနိုင်ငံ", "style": "style_07" } ``` --- ## 🧪 Usage ```python from datasets import load_dataset ds = load_dataset("kalixlouiis/MyanmarOCR-ImageText", split="train") print(ds[0]) ds[0]["image"].show() ``` --- ## 🎯 Intended Purposes - Burmese OCR training - Scene text model finetuning - Vision-language pretraining - Synthetic-to-real text recognition research - Benchmark for Myanmar multimodal AI --- ## ⚠️ Limitations - Synthetic images only — not real photos/signboards - No English text or punctuation - No complex layout structures (single word/short text per image) --- ## 📄 License This dataset is released under the **Creative Commons Attribution 4.0 International (CC-BY-4.0)** license. You are free to: - ✔ Share — copy and redistribute for any purpose - ✔ Adapt — modify, transform, and build upon the data As long as you: - **Give appropriate credit** - **Indicate changes** - **Provide a link to the license** 📌 License Text: https://creativecommons.org/licenses/by/4.0/ --- ## ✨ Acknowledgment Created by **[@kalixlouiis](https://huggingface.co/kalixlouiis)** with the goal of improving **Myanmar OCR and AI research**. If you use this dataset in your research or applications, please cite and provide a link to the dataset page on Hugging Face: 🔗 https://huggingface.co/datasets/kalixlouiis/MyanmarOCR-ImageText --- ## 📚 Citation ```bibtex @dataset{kalixlouiis2025myanmarocr, title = {MyanmarOCR-ImageText}, author = {Kalix Louis}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/datasets/kalixlouiis/MyanmarOCR-ImageText}}, license = {CC-BY-4.0} } ``` ---

提供机构：

kalixlouiis

5,000+

优质数据集

54 个

任务类型

进入经典数据集