Kannada_MNIST

Name: Kannada_MNIST
Creator: Kaggle
Published: 2019-08-04 00:00:00
License: 暂无描述

www.kaggle.com2019-08-04 更新2025-01-16 收录

下载链接：

https://www.kaggle.com/higgstachyon/kannada-mnist

下载链接

链接失效反馈

官方服务：

资源简介：

### Context Here, we disseminate a new handwritten digits-dataset, termed Kannada-MNIST, for the Kannada script, that can potentially serve as a direct drop-in replacement for the original MNIST dataset. In addition to this dataset, we disseminate an additional real world handwritten dataset (with images), which we term as the Dig-MNIST dataset that can serve as an out-of-domain test dataset. We also duly open source all the code as well as the raw scanned images along with the scanner settings so that researchers who want to try out different signal processing pipelines can perform end-to-end comparisons. We provide high level morphological comparisons with the MNIST dataset and provide baselines accuracies for the dataset disseminated. The initial baselines obtained using an oft-used CNN architecture ( for the main test-set and for the Dig-MNIST test-set) indicate that these datasets do provide a sterner challenge with regards to generalizability than MNIST or the KMNIST datasets. We also hope this dissemination will spur the creation of similar datasets for all the languages that use different symbols for the numeral digits. ### Content All details of the dataset curation has been captured in the paper titled: Prabhu, Vinay Uday. "Kannada-MNIST: A new handwritten digits dataset for the Kannada language." arXiv preprint arXiv:1908.01242 (2019). Link: https://arxiv.org/abs/1908.01242 ### GITHUB repository https://github.com/vinayprabhu/Kannada_MNIST ### Open challenges to the machine learning community We propose the following open challenges to the machine learning community at large. 1. Achieve MNIST-level accuracy by training on the Kannada-MNIST dataset and testing on the Dig-MNIST dataset without resorting to image pre-processing. 2. To characterize the nature of catastrophic forgetting when a CNN pre-trained on MNIST is retrained with Kannada-MNIST. This is particularly interesting given the observation that the typographical glyphs for 3 and 7 in Kannada-MNIST hold uncanny resemblance with the glyph for 2 in MNIST. 3. Get a model trained on purely synthetic data generated9 using the fonts (as in [22]) and augmenting using frameworks such as [20] and [23] to achieve high accuracy of the Kannada-MNIST and Dig-MNIST datasets. 4. Replicate the procedure described in the paper across different languages/scripts, especially the Indic scripts. 5. With regards to the dig-MNIST dataset, we saw that some of the volunteers had transgressed the borders of the grid and hence some of the images either have only a partial slice of the glyph/stroke or have an appearance where it can be argued that they could potentially belong to either of two different classes. With regards to these images, it would be worthwhile to see if we can design a classifier that will allocate proportionate softmax masses to the candidate classes. 6. The main reason behind us sharing the raw scan images was to foster research into auto-segmentation algorithms that will parse the individual digit images from the grid, which might in turn lead to higher quality of images in the upgraded versions of the dataset.

{'Context': '在此，我们发布了一个新的手写数字数据集，命名为Kannada-MNIST，旨在用于卡纳达语书写系统，该数据集有望直接替代原始的MNIST数据集。除了这一数据集，我们还发布了一个额外的现实世界手写数据集（包含图像），称为Dig-MNIST数据集，它可以作为域外测试数据集使用。我们还将所有代码以及原始扫描图像和扫描仪设置开源，以便希望尝试不同信号处理流程的研究人员能够进行端到端比较。我们提供了与MNIST数据集的高级形态学比较，并提供了发布数据集的基线准确率。使用常用的CNN架构（对于主测试集）和（对于Dig-MNIST测试集）获得的初始基线表明，这些数据集在泛化能力方面比MNIST或KMNIST数据集更具挑战性。我们希望这次发布能激发为所有使用不同符号表示数字的语言创建类似数据集的热情。', 'Content': '所有数据集整理的详细信息已收录在题为《Prabhu, Vinay Uday. "Kannada-MNIST: A new handwritten digits dataset for the Kannada language.', '2019)》的论文中。链接：https': 'arxiv.org/abs/1908.01242', 'arXiv': 1908.01242}

提供机构：

Kaggle

5,000+

优质数据集

54 个

任务类型

进入经典数据集