NNEngine/Captcha-OCR

Name: NNEngine/Captcha-OCR
Creator: NNEngine
Published: 2026-02-11 20:33:46
License: 暂无描述

Hugging Face2026-02-11 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/NNEngine/Captcha-OCR

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-classification language: - en size_categories: - 1M<n<10M --- # Synthetic CAPTCHA OCR Dataset (1M) ## Overview This dataset contains **synthetically generated CAPTCHA images** designed for training and benchmarking Optical Character Recognition (OCR) models. Each image contains a randomly generated alphanumeric string rendered in CAPTCHA style with noise, distortions, and visual artifacts to simulate real-world conditions. The dataset is created entirely using automated rendering pipelines and therefore contains perfectly accurate ground-truth labels. --- ## Dataset Characteristics - **Dataset size:** 1,000,000 images - **Image format:** PNG - **Image resolution:** 160 × 60 pixels - **Text length:** 5–10 characters - **Character set:** - Uppercase letters (A–Z) - Lowercase letters (a–z) - Digits (0–9) Each file is named using the ground-truth label: ``` <text>.png ``` Example: ``` A7kD3.png pQ82Lm.png ``` Thus, labels can be directly extracted from filenames without requiring an additional annotation file. ## Generation Methodology Images were generated using a synthetic rendering pipeline that includes: - Random font selection - Character position perturbations - Random background noise - Random line interference - Gaussian pixel noise This process improves robustness and helps OCR models generalize to real-world CAPTCHA images. ## Intended Use This dataset is suitable for: - Training deep learning OCR systems - CAPTCHA recognition research - Sequence recognition benchmarking - Synthetic data pretraining for document OCR systems - Curriculum learning before fine-tuning on real-world datasets ## Limitations - Images are synthetically generated and may not capture every real-world CAPTCHA style. - Domain adaptation may still be required for specific CAPTCHA systems. - Distribution of character sequences is random rather than language-based. ## Citation If you use this dataset in academic work, please cite: ``` Synthetic CAPTCHA OCR Dataset (1M), 2026 ```

提供机构：

NNEngine

5,000+

优质数据集

54 个

任务类型

进入经典数据集