expa-ai/BHDD

Name: expa-ai/BHDD
Creator: expa-ai
Published: 2026-03-24 05:25:57
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/expa-ai/BHDD

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 task_categories: - image-classification language: - my tags: - handwritten-digit-recognition - mnist-format - myanmar - burmese - ocr - computer-vision pretty_name: Burmese Handwritten Digit Dataset (BHDD) size_categories: - 10K<n<100K source_datasets: - original configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* dataset_info: features: - name: image dtype: image - name: label dtype: class_label: names: "0": "၀" "1": "၁" "2": "၂" "3": "၃" "4": "၄" "5": "၅" "6": "၆" "7": "၇" "8": "၈" "9": "၉" splits: - name: train num_examples: 60000 - name: test num_examples: 27561 --- # Burmese Handwritten Digit Dataset (BHDD) BHDD is the first publicly available dataset for handwritten Burmese (Myanmar) digit recognition — the Burmese counterpart to MNIST. **87,561** grayscale images (28×28 px) of handwritten Burmese digits across 10 classes, collected from **over 150 contributors**. | Split | Samples | Balanced? | |-------|---------|-----------| | Train | 60,000 | Yes (6,000 per class) | | Test | 27,561 | No (natural frequency) | The train/test split is **by contributor** — no writer's handwriting appears in both sets. ## Quick Start ```python from datasets import load_dataset ds = load_dataset("expa-ai/BHDD") # Access a sample sample = ds["train"][0] sample["image"] # PIL Image (28x28 grayscale) sample["label"] # int 0–9 ``` **With PyTorch:** ```python from datasets import load_dataset from torchvision import transforms ds = load_dataset("expa-ai/BHDD") transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)), ]) def preprocess(example): example["image"] = transform(example["image"].convert("L")) return example ds = ds.map(preprocess) ds.set_format(type="torch", columns=["image", "label"]) ``` ## Dataset Details - **Format:** 28×28 grayscale (uint8, 0–255), integer labels 0–9 - **Script:** Myanmar (Burmese), called *sar-lone* ("round script") - **Collection:** The Expa.AI Research Team organized a community collection. Contributors wrote digits on A4 paper; the team photographed sheets with phone cameras and extracted digits using an Android app with adaptive thresholding and contour detection - **Quality:** 20-member annotation team reviewed samples, then two data engineers did a final pass. No duplicates exist within or across splits ## Baseline Results | Model | Accuracy | Macro F1 | Parameters | |-------|----------|----------|------------| | MLP (2 hidden layers) | 99.40% | 0.993 | — | | CNN (2 conv layers) | 99.75% | 0.996 | 421K | | Improved CNN (3 conv + BN + augmentation) | **99.83%** | **0.998** | 431K | Only 47 of 27,561 test samples are misclassified by the best model. Errors cluster around the 0–1 pair (closed vs. open circle). ## Citation ```bibtex @article{aung2025bhdd, title = {{BHDD}: A Burmese Handwritten Digit Dataset}, author = {Swan Htet Aung and Hein Htet and Htoo Say Wah Khaing and Thuya Myo Nyunt}, year = {2025}, eprint = {2603.21966}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2603.21966} } ``` ## Links - **Paper:** [arXiv:2603.21966](https://arxiv.org/abs/2603.21966) - **GitHub:** [baseresearch/BHDD](https://github.com/baseresearch/BHDD) - **License:** CC BY-SA 4.0 ## Contributors - **Swan Htet Aung** — Lead Researcher, Expa.AI - **Hein Htet** — Research Engineer, Expa.AI - **Htoo Say Wah Khaing** — Data Engineer, Expa.AI - **Thuya Myo Nyunt** — Technical Lead, Expa.AI

提供机构：

expa-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集