five

NNEngine/Captcha-OCR

收藏
Hugging Face2026-02-11 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/NNEngine/Captcha-OCR
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-classification language: - en size_categories: - 1M<n<10M --- # Synthetic CAPTCHA OCR Dataset (1M) ## Overview This dataset contains **synthetically generated CAPTCHA images** designed for training and benchmarking Optical Character Recognition (OCR) models. Each image contains a randomly generated alphanumeric string rendered in CAPTCHA style with noise, distortions, and visual artifacts to simulate real-world conditions. The dataset is created entirely using automated rendering pipelines and therefore contains perfectly accurate ground-truth labels. --- ## Dataset Characteristics - **Dataset size:** 1,000,000 images - **Image format:** PNG - **Image resolution:** 160 × 60 pixels - **Text length:** 5–10 characters - **Character set:** - Uppercase letters (A–Z) - Lowercase letters (a–z) - Digits (0–9) Each file is named using the ground-truth label: ``` <text>.png ``` Example: ``` A7kD3.png pQ82Lm.png ``` Thus, labels can be directly extracted from filenames without requiring an additional annotation file. ## Generation Methodology Images were generated using a synthetic rendering pipeline that includes: - Random font selection - Character position perturbations - Random background noise - Random line interference - Gaussian pixel noise This process improves robustness and helps OCR models generalize to real-world CAPTCHA images. ## Intended Use This dataset is suitable for: - Training deep learning OCR systems - CAPTCHA recognition research - Sequence recognition benchmarking - Synthetic data pretraining for document OCR systems - Curriculum learning before fine-tuning on real-world datasets ## Limitations - Images are synthetically generated and may not capture every real-world CAPTCHA style. - Domain adaptation may still be required for specific CAPTCHA systems. - Distribution of character sequences is random rather than language-based. ## Citation If you use this dataset in academic work, please cite: ``` Synthetic CAPTCHA OCR Dataset (1M), 2026 ```
提供机构:
NNEngine
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作