mnist

Name: mnist
Creator: maas
Published: 2026-05-11 10:19:37
License: 暂无描述

魔搭社区2026-05-11 更新2024-09-28 收录

下载链接：

https://modelscope.cn/datasets/cutedataset/mnist

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for MNIST ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** http://yann.lecun.com/exdb/mnist/ - **Repository:** - **Paper:** MNIST handwritten digit database by Yann LeCun, Corinna Cortes, and CJ Burges - **Leaderboard:** - **Point of Contact:** ### Dataset Summary The MNIST dataset consists of 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. There are 60,000 images in the training dataset and 10,000 images in the validation dataset, one class per digit so a total of 10 classes, with 7,000 images (6,000 train images and 1,000 test images) per class. Half of the image were drawn by Census Bureau employees and the other half by high school students (this split is evenly distributed in the training and testing sets). ### Supported Tasks and Leaderboards - `image-classification`: The goal of this task is to classify a given image of a handwritten digit into one of 10 classes representing integer values from 0 to 9, inclusively. The leaderboard is available [here](https://paperswithcode.com/sota/image-classification-on-mnist). ### Languages English ## Dataset Structure ### Data Instances A data point comprises an image and its label: ``` { 'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=28x28 at 0x276021F6DD8>, 'label': 5 } ``` ### Data Fields - `image`: A `PIL.Image.Image` object containing the 28x28 image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]` - `label`: an integer between 0 and 9 representing the digit. ### Data Splits The data is split into training and test set. All the images in the test set were drawn by different individuals than the images in the training set. The training set contains 60,000 images and the test set 10,000 images. ## Dataset Creation ### Curation Rationale The MNIST database was created to provide a testbed for people wanting to try pattern recognition methods or machine learning algorithms while spending minimal efforts on preprocessing and formatting. Images of the original dataset (NIST) were in two groups, one consisting of images drawn by Census Bureau employees and one consisting of images drawn by high school students. In NIST, the training set was built by grouping all the images of the Census Bureau employees, and the test set was built by grouping the images form the high school students. The goal in building MNIST was to have a training and test set following the same distributions, so the training set contains 30,000 images drawn by Census Bureau employees and 30,000 images drawn by high school students, and the test set contains 5,000 images of each group. The curators took care to make sure all the images in the test set were drawn by different individuals than the images in the training set. ### Source Data #### Initial Data Collection and Normalization The original images from NIST were size normalized to fit a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels (i.e., pixels don't simply have a value of black and white, but a level of greyness from 0 to 255) as a result of the anti-aliasing technique used by the normalization algorithm. The images were then centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field. #### Who are the source language producers? Half of the source images were drawn by Census Bureau employees, half by high school students. According to the dataset curator, the images from the first group are more easily recognizable. ### Annotations #### Annotation process The images were not annotated after their creation: the image creators annotated their images with the corresponding label after drawing them. #### Who are the annotators? Same as the source data creators. ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators Chris Burges, Corinna Cortes and Yann LeCun ### Licensing Information MIT Licence ### Citation Information ``` @article{lecun2010mnist, title={MNIST handwritten digit database}, author={LeCun, Yann and Cortes, Corinna and Burges, CJ}, journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist}, volume={2}, year={2010} } ``` ### Contributions Thanks to [@sgugger](https://github.com/sgugger) for adding this dataset.

# MNIST 数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与排行榜](#支持任务与排行榜) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [数据集遴选缘由](#数据集遴选缘由) - [源数据](#源数据) - [标注](#标注) - [个人与敏感信息](#个人与敏感信息) - [数据集使用注意事项](#数据集使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏差讨论](#偏差讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集遴选者](#数据集遴选者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献](#贡献) ## 数据集描述 - **主页**：http://yann.lecun.com/exdb/mnist/ - **代码仓库**： - **论文**：Yann LeCun、Corinna Cortes与CJ Burges所著的《MNIST手写数字数据库》 - **排行榜**： - **联系方式**： ### 数据集概述 MNIST数据集包含70000张28×28的手写数字黑白图像，这些图像取自两个美国国家标准与技术研究院(NIST)数据库。数据集包含60000张训练图像与10000张验证图像，每个数字对应一个类别，总计10个类别，每个类别包含7000张图像（其中训练集6000张，测试集1000张）。其中一半图像由美国人口普查局雇员绘制，另一半由高中生绘制（该划分在训练集与测试集中均均匀分布）。 ### 支持任务与排行榜 - `图像分类`：该任务的目标是将给定的手写数字图像分类为10个类别之一，对应0至9的整数值。排行榜可参见[此处](https://paperswithcode.com/sota/image-classification-on-mnist)。 ### 语言英语 ## 数据集结构 ### 数据实例一个数据点包含一张图像及其标签： { 'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=28x28 at 0x276021F6DD8>, 'label': 5 } ### 数据字段 - `image`：包含28×28图像的`Python图像库(PIL).Image.Image` 对象。请注意，当访问图像列时：`dataset[0]["image"]` 会自动对图像文件进行解码。解码大量图像文件可能会耗费大量时间，因此建议始终先查询样本索引再访问`"image"`列，即**优先使用`dataset[0]["image"]`而非`dataset["image"][0]`**。 - `label`：介于0至9之间的整数，表示对应的手写数字。 ### 数据划分数据集被划分为训练集与测试集。测试集中的所有图像均由与训练集图像不同的个体绘制。训练集包含60000张图像，测试集包含10000张图像。 ## 数据集构建 ### 数据集遴选缘由 MNIST数据库的构建旨在为想要尝试模式识别方法或机器学习算法的研究人员提供一个无需在预处理与格式调整上耗费过多精力的测试平台。原始NIST数据集的图像分为两组：一组由美国人口普查局雇员绘制，另一组由高中生绘制。在原始NIST数据集中，训练集由所有人口普查局雇员绘制的图像组成，测试集则由高中生绘制的图像组成。构建MNIST的目标是使训练集与测试集遵循相同的分布，因此训练集包含30000张人口普查局雇员绘制的图像与30000张高中生绘制的图像，测试集则每组各包含5000张图像。数据集遴选者确保了测试集中的所有图像均由与训练集图像不同的个体绘制。 ### 源数据 #### 初始数据收集与归一化 NIST的原始图像被调整尺寸以适配20×20像素的方框，同时保留其宽高比。由于归一化算法使用了抗锯齿技术，最终生成的图像包含灰度级（即像素值并非单纯的黑白，而是具有0至255的灰度等级）。随后，通过计算像素的质心并平移图像，将其居中放置在28×28的画布中。 #### 源数据创作者是谁？一半的源图像由美国人口普查局雇员绘制，另一半由高中生绘制。据数据集遴选者透露，第一组（人口普查局雇员）绘制的图像更易于识别。 ### 标注 #### 标注流程图像在创作完成后并未进行额外标注：图像创作者在绘制图像时便为其添加了对应的标签。 #### 标注者是谁？与源数据创作者一致。 ### 个人与敏感信息 [需要更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需要更多信息] ### 偏差讨论 [需要更多信息] ### 其他已知局限性 [需要更多信息] ## 附加信息 ### 数据集遴选者 Chris Burges、Corinna Cortes与Yann LeCun ### 许可信息 MIT许可证 ### 引用信息 @article{lecun2010mnist, title={MNIST手写数字数据库}, author={LeCun, Yann and Cortes, Corinna and Burges, CJ}, journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist}, volume={2}, year={2010} } ### 贡献感谢 [@sgugger](https://github.com/sgugger) 为本数据集提供贡献。

提供机构：

maas

创建时间：

2024-11-04

搜集汇总

数据集介绍