MathCaptcha10k
收藏魔搭社区2025-08-29 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/atalaydenknalbant/MathCaptcha10k
下载链接
链接失效反馈官方服务:
资源简介:

## Dataset Details
* **Dataset Name:** MathCaptcha10k
* **Curated by:** Atalay Denknalbant
* **License:** Creative Commons Attribution 4.0 International (CC BY 4.0)
* **Repository:** [https://www.kaggle.com/datasets/atalaydenknalbant/mathcaptcha10k](https://www.kaggle.com/datasets/atalaydenknalbant/mathcaptcha10k)
### Dataset Description
A corpus of 10 000 synthetic arithmetic‐captcha images rendered at 200×70 px. Each image contains exactly two base-10 numbers (1–2 digits), a single `+` or `–` operator, an `=` sign and a trailing question mark (e.g. `96-41=?`). Every example in the **train** split includes:
| image | ocr\_text | result |
| -------------------------- | --------- | ------ |
| `96-41=?` | "96-41=?" | 55 |
…where `ocr_text` is the exact characters in the image, and `result` is the integer answer.
The **test** split consists of 11 766 unlabeled captchas in `Unlabeled/` folder.
---
## Examples of the Captchas
**Easy example**

**Challenging example**

> Even state-of-the-art vision-language models often mis‐OCR the more distorted variants (see the “challenging” sample above).
---
## Uses
* **Direct uses**:
* Train and evaluate OCR/vision-language models on simple arithmetic recognition.
* Benchmark visual math-solving capabilities.
* **Out-of-scope uses**:
* Handwritten digit OCR.
* Complex mathematical notation beyond two-term arithmetic.
---
## Dataset Structure
* **Splits**
* `train` (10 000 labeled examples)
* `test` (11 766 `.png` files in `Unlabeled/`)
* **Features**
* `image` (PNG file)
* `ocr_text` (string, e.g. `"75-26=?"`)
* `result` (int, e.g. `49`)
---
## Dataset Creation
### Curation Rationale
Synthetic captchas provide a controlled environment for training and benchmarking. Even top tier vision language methods struggle with some distortions motivating manual QA to ensure label accuracy.
### Source Data
Programmatically generated using [CaptchaMvc.Mvc5](https://www.nuget.org/packages/CaptchaMvc.Mvc5)’s standard arithmetic template.
### Data Collection & Processing
1. Generate 10 000 PNG captchas via CaptchaMvc.Mvc5.
2. Run a VLM-based OCR pipeline, then manually verify and correct every label in a Streamlit QA app.
**Annotator:**
* Atalay Denknalbant
---
## Personal & Sensitive Information
None. Captchas contain no personal data.
---
## Bias, Risks & Limitations
* Purely synthetic; may not generalize to natural or handwritten text.
* Limited to two-term, 1–2 digit arithmetic.
---
## Recommendations
Combine with broader OCR datasets for real-world text recognition tasks.
---
## Citation
```bibtex
@misc{atalay_denknalbant_2025,
title = {MathCaptcha10k},
author = {Atalay Denknalbant},
year = {2025},
howpublished = {\url{https://www.kaggle.com/ds/7779792}},
publisher = {Kaggle},
DOI = {10.34740/KAGGLE/DS/7779792}
}
```
**APA**
> Denknalbant, A. (2025). *MathCaptcha10k* \[Data set]. Kaggle. [https://doi.org/10.34740/KAGGLE/DS/7779792](https://doi.org/10.34740/KAGGLE/DS/7779792)
## Dataset Card Authors
* Atalay Denknalbant
## Dataset Card Contact
* Atalay Denknalbant (questions & feedback)

## 数据集详情
* **数据集名称**:MathCaptcha10k
* **整理者**:Atalay Denknalbant
* **许可协议**:知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International,CC BY 4.0)
* **仓库地址**:[https://www.kaggle.com/datasets/atalaydenknalbant/mathcaptcha10k](https://www.kaggle.com/datasets/atalaydenknalbant/mathcaptcha10k)
### 数据集描述
本数据集包含10000张分辨率为200×70像素的合成算术验证码(arithmetic-captcha)图像。每张图像均恰好包含两个十进制数字(1至2位)、一个`+`或`-`运算符、一个`=`符号以及末尾的问号(例如`96-41=?`)。**训练集**划分下的每条样本包含以下内容:
| 图像文件名 | OCR文本 | 计算结果 |
| ---------- | ------- | -------- |
| `96-41=?` | "96-41=?" | 55 |
其中`ocr_text`为图像中的精确字符序列,`result`为对应的整数计算结果。
**测试集**划分包含`Unlabeled/`文件夹下的11766条未标注验证码样本。
---
## 验证码示例
**简单示例**

**较难示例**

> 即便当前顶尖的视觉语言模型,也常会对畸变程度较高的样本出现光学字符识别(Optical Character Recognition,OCR)错误(详见上方的“较难”示例)。
---
## 数据集用途
* **直接用途**:
1. 针对简单算术识别任务,训练并评估OCR/视觉语言模型;
2. 对视觉数学求解能力进行基准测试。
* **超出范围的用途**:
1. 手写数字OCR任务;
2. 双项算术以外的复杂数学符号识别任务。
---
## 数据集结构
* **数据划分**
* 训练集(train):10000条带标注样本
* 测试集(test):`Unlabeled/`文件夹下的11766张`.png`格式图像文件
* **数据特征**
* `image`:PNG格式图像文件
* `ocr_text`:字符串类型,例如`"75-26=?"`
* `result`:整数类型,例如`49`
---
## 数据集构建
### 遴选依据
合成验证码可为训练与基准测试提供可控的实验环境。即便顶尖的视觉语言模型也会在部分畸变样本上出现识别失误,因此需通过人工质检以确保标注的准确性。
### 原始数据来源
通过[CaptchaMvc.Mvc5](https://www.nuget.org/packages/CaptchaMvc.Mvc5)的标准算术模板编程生成。
### 数据收集与处理流程
1. 通过CaptchaMvc.Mvc5生成10000张PNG格式验证码图像;
2. 运行基于视觉语言模型(Vision-Language Model,VLM)的OCR流水线,随后通过Streamlit开发的质检应用对所有标注进行人工校验与修正。
**标注者**:Atalay Denknalbant
---
## 个人与敏感信息
本数据集无个人或敏感信息,验证码未包含任何个人数据。
---
## 偏差、风险与局限性
* 本数据集完全由合成生成,可能无法泛化至自然文本或手写文本场景;
* 仅支持双项、1至2位数字的算术任务。
---
## 应用建议
可与更广泛的OCR数据集结合,用于真实场景下的文本识别任务。
---
## 引用格式
bibtex
@misc{atalay_denknalbant_2025,
title = {MathCaptcha10k},
author = {Atalay Denknalbant},
year = {2025},
howpublished = {url{https://www.kaggle.com/ds/7779792}},
publisher = {Kaggle},
DOI = {10.34740/KAGGLE/DS/7779792}
}
**APA格式**
> Denknalbant, A. (2025). *MathCaptcha10k* [数据集]. Kaggle. https://doi.org/10.34740/KAGGLE/DS/7779792
---
## 数据集卡片作者
* Atalay Denknalbant
## 数据集卡片联系方式
* Atalay Denknalbant(咨询与反馈)
提供机构:
maas
创建时间:
2025-08-01



