expa-ai/BHDD
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/expa-ai/BHDD
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
task_categories:
- image-classification
language:
- my
tags:
- handwritten-digit-recognition
- mnist-format
- myanmar
- burmese
- ocr
- computer-vision
pretty_name: Burmese Handwritten Digit Dataset (BHDD)
size_categories:
- 10K<n<100K
source_datasets:
- original
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
dataset_info:
features:
- name: image
dtype: image
- name: label
dtype:
class_label:
names:
"0": "၀"
"1": "၁"
"2": "၂"
"3": "၃"
"4": "၄"
"5": "၅"
"6": "၆"
"7": "၇"
"8": "၈"
"9": "၉"
splits:
- name: train
num_examples: 60000
- name: test
num_examples: 27561
---
# Burmese Handwritten Digit Dataset (BHDD)
BHDD is the first publicly available dataset for handwritten Burmese (Myanmar) digit recognition — the Burmese counterpart to MNIST.
**87,561** grayscale images (28×28 px) of handwritten Burmese digits across 10 classes, collected from **over 150 contributors**.
| Split | Samples | Balanced? |
|-------|---------|-----------|
| Train | 60,000 | Yes (6,000 per class) |
| Test | 27,561 | No (natural frequency) |
The train/test split is **by contributor** — no writer's handwriting appears in both sets.
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("expa-ai/BHDD")
# Access a sample
sample = ds["train"][0]
sample["image"] # PIL Image (28x28 grayscale)
sample["label"] # int 0–9
```
**With PyTorch:**
```python
from datasets import load_dataset
from torchvision import transforms
ds = load_dataset("expa-ai/BHDD")
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
])
def preprocess(example):
example["image"] = transform(example["image"].convert("L"))
return example
ds = ds.map(preprocess)
ds.set_format(type="torch", columns=["image", "label"])
```
## Dataset Details
- **Format:** 28×28 grayscale (uint8, 0–255), integer labels 0–9
- **Script:** Myanmar (Burmese), called *sar-lone* ("round script")
- **Collection:** The Expa.AI Research Team organized a community collection. Contributors wrote digits on A4 paper; the team photographed sheets with phone cameras and extracted digits using an Android app with adaptive thresholding and contour detection
- **Quality:** 20-member annotation team reviewed samples, then two data engineers did a final pass. No duplicates exist within or across splits
## Baseline Results
| Model | Accuracy | Macro F1 | Parameters |
|-------|----------|----------|------------|
| MLP (2 hidden layers) | 99.40% | 0.993 | — |
| CNN (2 conv layers) | 99.75% | 0.996 | 421K |
| Improved CNN (3 conv + BN + augmentation) | **99.83%** | **0.998** | 431K |
Only 47 of 27,561 test samples are misclassified by the best model. Errors cluster around the 0–1 pair (closed vs. open circle).
## Citation
```bibtex
@article{aung2025bhdd,
title = {{BHDD}: A Burmese Handwritten Digit Dataset},
author = {Swan Htet Aung and Hein Htet and Htoo Say Wah Khaing and Thuya Myo Nyunt},
year = {2025},
eprint = {2603.21966},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2603.21966}
}
```
## Links
- **Paper:** [arXiv:2603.21966](https://arxiv.org/abs/2603.21966)
- **GitHub:** [baseresearch/BHDD](https://github.com/baseresearch/BHDD)
- **License:** CC BY-SA 4.0
## Contributors
- **Swan Htet Aung** — Lead Researcher, Expa.AI
- **Hein Htet** — Research Engineer, Expa.AI
- **Htoo Say Wah Khaing** — Data Engineer, Expa.AI
- **Thuya Myo Nyunt** — Technical Lead, Expa.AI
提供机构:
expa-ai



