Devanagari Handwritten CAPTCHA - Dataset of 90 K Images : A Challenge Test

Name: Devanagari Handwritten CAPTCHA - Dataset of 90 K Images : A Challenge Test
Creator: IEEE DataPort
Published: 2023-12-11 03:33:09
License: 暂无描述

DataCite Commons2023-12-11 更新2025-04-16 收录

下载链接：

https://ieee-dataport.org/documents/devanagari-handwritten-captcha-dataset-90-k-images-challenge-test

下载链接

链接失效反馈

官方服务：

资源简介：

Captcha stands for Completely Automated Public Turing Tests to Distinguish Between Humans and Computers. This test cannot be successfully completed by current computer systems; only humans can. It is applied in several contexts for machine and human identification. The most common kind found on websites are text-based CAPTCHAs.A CAPTCHA is made up of a series of alphabets or numbers that are linked together in a certain order. Random lines, blocks, grids, rotations, and other sorts of noise have been used to distort this image.It is difficult for rural residents who only speak their local tongues to pass the test because the majority of the letters in this protected CAPTCHA script are in English. Machine identification of Devanagari characters is significantly more challenging due to their higher character complexity compared to normal English characters and numeral-based CAPTCHAs. The vast majority of official Indian websites exclusively provide content in Devanagari. Regretfully, websites do not employ CAPTCHAs in Devanagari. Because of this, we have developed a brand-new text-based CAPTCHA using Devanagari writing.A canvas was created using Python. This canvas code is distributed to more than one hundred (100+) Devanagari native speakers of all ages, including both left- and right-handed computer users. Each user writes 440 characters (44 characters multiplied by 10) on the canvas and saves it on their computers. All user data is then gathered and compiled. The character on the canvas is black with a white background. No noise in the image is a benefit of using canvas. The final data set contains a total of 44,000 digitized images, 10,000 numerals, 4000 vowels, and 30,000 consonants. This dataset was published for research scholars for recognition and other applications on Mendeley (Mendeley Data, DOI: 10.17632/yb9rmfjzc2.1, dated October 5, 2022) and the IEEE data port (DOI: 10.21227/9zpv-3194, dated October 6, 2022).We have designed our own algorithm to design the Handwritten Devanagari CAPTCHA. We used the above-created handwritten character set. General CAPTCHA generation principles are used to add noise to the image using digital image processing techniques. The size of each CAPTCHA image is 250 x 90 pixels. Three (03) types of character sets are used: handwritten alphabets, handwritten digits, and handwritten alphabets and digits combined. For 09 Classes X 10,000 images , a Devanagari CAPTCHA data set of 90,0000 images was created using Python. All images are stored in CSV format for easy use to researchers. To make the CAPTCHA image less recognized or not easily broken. Passing a test identifying Devanagari alphabets is difficult. It is beneficial to researchers who are investigating captcha recognition in this area. This dataset is helpful to researchers in designing OCR to recognize Devanagari CAPTCHA and break it. If you are able to successfully bypass the CAPTCHA, please acknowledge us by sending an email to sanjayepate@gmail.com.

验证码（CAPTCHA）全称为全自动区分计算机和人类的图灵测试（Completely Automated Public Turing Tests to Distinguish Between Humans and Computers）。当前计算机系统无法成功完成此类测试，仅人类方可通过。该测试被应用于多种人机身份识别场景。网站中最常见的类型为基于文本的验证码（text-based CAPTCHAs）。验证码由按特定顺序排列的一系列字母或数字组成，这类图像会通过随机线条、色块、网格、旋转及其他各类噪声进行干扰。由于受保护的验证码脚本中的绝大多数字符均为英文字母，仅使用本地语言的乡村居民难以通过此类测试。天城文（Devanagari）字符相较于普通英文字母及基于数字的验证码，复杂度更高，因此机器识别天城文字符的难度显著更大。印度绝大多数官方网站仅提供天城文内容，但遗憾的是，相关网站并未采用天城文格式的验证码。有鉴于此，我们开发了一款全新的基于天城文书写的文本验证码。我们使用Python创建了绘制画布，将该画布代码分发给100余名不同年龄段的天城文母语使用者，其中包含左利手与右利手的计算机用户。每位用户需在画布上书写440个字符（即44种字符各书写10次）并将图像保存至本地设备。随后我们收集并整合所有用户提交的数据。画布生成的图像为黑字白底，未添加任何噪声，这一特性为图像预处理带来了显著便利。最终的数据集共包含44000张数字化图像，其中数字图像10000张、元音图像4000张、辅音图像30000张。该数据集已发布至Mendeley（Mendeley Data，DOI: 10.17632/yb9rmfjzc2.1，发布时间为2022年10月5日）与IEEE数据端口（IEEE Data Port，DOI: 10.21227/9zpv-3194，发布时间为2022年10月6日），供研究学者开展识别及其他相关应用研究。我们自研了算法以生成手写体天城文验证码，并使用上述构建的手写字符集。我们遵循通用验证码生成原理，通过数字图像处理技术为图像添加噪声干扰。每张验证码图像的尺寸为250×90像素。本次数据集采用三类字符集：手写字母、手写数字，以及手写字母与数字的组合。通过9类×10000张图像的配置，我们使用Python生成了共计900000张天城文验证码数据集。所有图像均以CSV格式存储，以便研究人员便捷使用。为提升验证码的抗识别性，降低被破解的可能性，识别天城文字母的测试本身具备较高难度。本数据集对致力于该领域验证码识别研究的学者具有重要价值，可辅助研究人员设计用于识别并破解天城文验证码的光学字符识别（OCR，Optical Character Recognition）系统。若您能够成功绕过此类验证码，请发送邮件至sanjayepate@gmail.com以告知我们，敬请鸣谢。

提供机构：

IEEE DataPort

创建时间：

2023-12-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集