flwrlabs/femnist

Name: flwrlabs/femnist
Creator: flwrlabs
Published: 2024-04-24 10:03:35
License: 暂无描述

Hugging Face2024-04-24 更新2024-06-26 收录

下载链接：

https://hf-mirror.com/datasets/flwrlabs/femnist

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: bsd-2-clause dataset_info: features: - name: image dtype: image - name: writer_id dtype: string - name: hsf_id dtype: int64 - name: character dtype: class_label: names: '0': '0' '1': '1' '2': '2' '3': '3' '4': '4' '5': '5' '6': '6' '7': '7' '8': '8' '9': '9' '10': A '11': B '12': C '13': D '14': E '15': F '16': G '17': H '18': I '19': J '20': K '21': L '22': M '23': 'N' '24': O '25': P '26': Q '27': R '28': S '29': T '30': U '31': V '32': W '33': X '34': 'Y' '35': Z '36': a '37': b '38': c '39': d '40': e '41': f '42': g '43': h '44': i '45': j '46': k '47': l '48': m '49': 'n' '50': o '51': p '52': q '53': r '54': s '55': t '56': u '57': v '58': w '59': x '60': 'y' '61': z splits: - name: train num_bytes: 206539811.49 num_examples: 814277 download_size: 200734290 dataset_size: 206539811.49 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - image-classification size_categories: - 100K<n<1M --- # Dataset Card for FEMNIST The FEMNIST dataset is a part of the [LEAF](https://leaf.cmu.edu/) benchmark. It represents image classification of handwritten digits, lower and uppercase letters, giving 62 unique labels. ## Dataset Details ### Dataset Description Each sample is comprised of a (28x28) grayscale image, writer_id, hsf_id, and character. - **Curated by:** [LEAF](https://leaf.cmu.edu/) - **License:** BSD 2-Clause License ### Dataset Sources The FEMNIST is a preprocessed (in a way that resembles preprocessing for MNIST) version of [NIST SD 19](https://www.nist.gov/srd/nist-special-database-19). ## Uses This dataset is intended to be used in Federated Learning settings. ### Direct Use We recommend using [Flower Dataset](https://flower.ai/docs/datasets/) (flwr-datasets) and [Flower](https://flower.ai/docs/framework/) (flwr). To partition the dataset, do the following. 1. Install the package. ```bash pip install flwr-datasets[vision] ``` 2. Use the HF Dataset under the hood in Flower Datasets. ```python from flwr_datasets import FederatedDataset from flwr_datasets.partitioner import NaturalIdPartitioner fds = FederatedDataset( dataset="flwrlabs/femnist", partitioners={"train": NaturalIdPartitioner(partition_by="writer_id")} ) partition = fds.load_partition(partition_id=0) ``` ## Dataset Structure The whole dataset is kept in the train split. If you want to leave out some part of the dataset for centralized evaluation, use Resplitter. (The full example is coming soon here) Dataset fields: * image: grayscale of size (28, 28), PIL Image, * writer_id: string, unique value per each writer, * hsf_id: string, corresponds to the way that the data was collected (see more details [here](https://www.nist.gov/srd/nist-special-database-19), * character: ClassLabel (it means it's int if you access it in the dataset, but you can convert it to the original value by `femnist["train"].features["character"].int2str(value)`. ## Dataset Creation ### Curation Rationale This dataset was created as a part of the [LEAF](https://leaf.cmu.edu/) benchmark. We make it available in the HuggingFace Hub to facilitate its seamless use in FlowerDatasets. ### Source Data [NIST SD 19](https://www.nist.gov/srd/nist-special-database-19) #### Data Collection and Processing For the preprocessing details, please refer to the original paper, the source code and [NIST SD 19](https://www.nist.gov/srd/nist-special-database-19) #### Who are the source data producers? For the preprocessing details, please refer to the original paper, the source code and [NIST SD 19](https://www.nist.gov/srd/nist-special-database-19) ## Citation When working on the LEAF benchmark, please cite the original paper. If you're using this dataset with Flower Datasets, you can cite Flower. **BibTeX:** ``` @article{DBLP:journals/corr/abs-1812-01097, author = {Sebastian Caldas and Peter Wu and Tian Li and Jakub Kone{\v{c}}n{\'y} and H. Brendan McMahan and Virginia Smith and Ameet Talwalkar}, title = {{LEAF:} {A} Benchmark for Federated Settings}, journal = {CoRR}, volume = {abs/1812.01097}, year = {2018}, url = {http://arxiv.org/abs/1812.01097}, eprinttype = {arXiv}, eprint = {1812.01097}, timestamp = {Wed, 23 Dec 2020 09:35:18 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1812-01097.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` ``` @article{DBLP:journals/corr/abs-2007-14390, author = {Daniel J. Beutel and Taner Topal and Akhil Mathur and Xinchi Qiu and Titouan Parcollet and Nicholas D. Lane}, title = {Flower: {A} Friendly Federated Learning Research Framework}, journal = {CoRR}, volume = {abs/2007.14390}, year = {2020}, url = {https://arxiv.org/abs/2007.14390}, eprinttype = {arXiv}, eprint = {2007.14390}, timestamp = {Mon, 03 Aug 2020 14:32:13 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2007-14390.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ``` ## Dataset Card Contact In case of any doubts, please contact [Flower Labs](https://flower.ai/).

# FEMNIST 数据集卡片许可证：BSD 2条款许可证 ## 数据集元信息 ### 数据集特征 1. **图像（image）**：图像格式数据 2. **作者ID（writer_id）**：字符串类型，用于标识手写样本的创作者 3. **HSF ID（hsf_id）**：64位整数类型，对应数据采集相关标识 4. **字符标签（character）**：类别标签（class_label），类别索引与对应字符的映射关系如下： | 索引 | 字符 | | --- | --- | | 0 | 0 | | 1 | 1 | | 2 | 2 | | 3 | 3 | | 4 | 4 | | 5 | 5 | | 6 | 6 | | 7 | 7 | | 8 | 8 | | 9 | 9 | | 10 | A | | 11 | B | | 12 | C | | 13 | D | | 14 | E | | 15 | F | | 16 | G | | 17 | H | | 18 | I | | 19 | J | | 20 | K | | 21 | L | | 22 | M | | 23 | N | | 24 | O | | 25 | P | | 26 | Q | | 27 | R | | 28 | S | | 29 | T | | 30 | U | | 31 | V | | 32 | W | | 33 | X | | 34 | Y | | 35 | Z | | 36 | a | | 37 | b | | 38 | c | | 39 | d | | 40 | e | | 41 | f | | 42 | g | | 43 | h | | 44 | i | | 45 | j | | 46 | k | | 47 | l | | 48 | m | | 49 | n | | 50 | o | | 51 | p | | 52 | q | | 53 | r | | 54 | s | | 55 | t | | 56 | u | | 57 | v | | 58 | w | | 59 | x | | 60 | y | | 61 | z | ### 数据拆分训练拆分（train）：占用字节数206539811.49，共包含814277个样本 ### 下载与数据集规模下载大小：200734290字节，数据集总大小：206539811.49字节 ### 配置项默认配置（default）：数据文件路径为`data/train-*`，对应训练拆分 ### 任务类别图像分类 ### 样本规模 10万 < 样本数量 < 100万 --- FEMNIST 数据集属于 [LEAF](https://leaf.cmu.edu/) 基准测试套件的一部分，用于手写数字、大小写字母的图像分类任务，共包含62个唯一类别标签。 ## 数据集详情 ### 数据集概述每个样本由一张(28×28)灰度图像、作者ID、HSF ID以及字符标签组成。 - **整理方**：[LEAF](https://leaf.cmu.edu/) - **许可证**：BSD 2条款许可证 ### 数据集来源 FEMNIST 是对 [NIST SD 19](https://www.nist.gov/srd/nist-special-database-19) 进行预处理后的版本，其预处理逻辑与MNIST数据集的预处理方式相近。 ## 数据集用途本数据集专为联邦学习场景设计。 ### 直接使用建议我们推荐使用 [Flower 数据集](https://flower.ai/docs/datasets/)（flwr-datasets）与 [Flower](https://flower.ai/docs/framework/)（flwr）框架开展相关研究与开发。如需对数据集进行分区，可按照以下步骤操作： 1. 安装依赖包： bash pip install flwr-datasets[vision] 2. 在 Flower 数据集框架中调用 Hugging Face 原生数据集： python from flwr_datasets import FederatedDataset from flwr_datasets.partitioner import NaturalIdPartitioner fds = FederatedDataset( dataset="flwrlabs/femnist", partitioners={"train": NaturalIdPartitioner(partition_by="writer_id")} ) partition = fds.load_partition(partition_id=0) ## 数据集结构整个数据集仅包含训练拆分。如需预留部分数据用于集中式评估，可使用`Resplitter`工具进行拆分（完整示例将后续补充）。数据集字段说明如下： * **image**：尺寸为(28, 28)的灰度图像，格式为PIL图像 * **writer_id**：字符串类型，每个手写作者对应唯一的ID * **hsf_id**：整数类型，对应数据采集的具体方式（详见[官方文档](https://www.nist.gov/srd/nist-special-database-19)） * **character**：类别标签（class_label）：在数据集中以整数形式存储，可通过`femnist["train"].features["character"].int2str(value)`将其转换为原始字符值。 ## 数据集构建 ### 构建初衷本数据集作为 [LEAF](https://leaf.cmu.edu/) 基准测试套件的一部分开发完成，我们将其上传至 Hugging Face Hub，以方便在 FlowerDatasets 中无缝使用。 ### 源数据 [NIST SD 19](https://www.nist.gov/srd/nist-special-database-19) #### 数据采集与预处理如需了解预处理细节，请参考原始论文、源代码以及 [NIST SD 19](https://www.nist.gov/srd/nist-special-database-19) 官方文档。 #### 原始数据生产者如需了解相关信息，请参考原始论文、源代码以及 [NIST SD 19](https://www.nist.gov/srd/nist-special-database-19) 官方文档。 ## 引用规范当使用 LEAF 基准测试套件时，请引用原始论文；若结合 Flower Datasets 使用本数据集，请同时引用 Flower 相关文献。 **BibTeX 引用格式：** @article{DBLP:journals/corr/abs-1812-01097, author = {Sebastian Caldas and Peter Wu and Tian Li and Jakub Kone{v}c{n}'y} and H. Brendan McMahan and Virginia Smith and Ameet Talwalkar}, title = {{LEAF:} {A} Benchmark for Federated Settings}, journal = {CoRR}, volume = {abs/1812.01097}, year = {2018}, url = {http://arxiv.org/abs/1812.01097}, eprinttype = {arXiv}, eprint = {1812.01097}, timestamp = {Wed, 23 Dec 2020 09:35:18 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1812-01097.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } @article{DBLP:journals/corr/abs-2007-14390, author = {Daniel J. Beutel and Taner Topal and Akhil Mathur and Xinchi Qiu and Titouan Parcollet and Nicholas D. Lane}, title = {Flower: {A} Friendly Federated Learning Research Framework}, journal = {CoRR}, volume = {abs/2007.14390}, year = {2020}, url = {https://arxiv.org/abs/2007.14390}, eprinttype = {arXiv}, eprint = {2007.14390}, timestamp = {Mon, 03 Aug 2020 14:32:13 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2007.14390.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ## 数据集卡片联系方式如有任何疑问，请联系 [Flower Labs](https://flower.ai/).

提供机构：

flwrlabs

原始信息汇总

数据集卡片：FEMNIST

数据集描述

特征

image: 图像，数据类型为图像。
writer_id: 字符串，表示书写者的唯一标识。
hsf_id: 整数，表示数据收集方式的标识。
character: 分类标签，包含62个类别，从数字0到字母z。

分割

train: 训练集，包含814277个样本，大小为206539811.49字节。

数据集大小

下载大小: 200734290字节
数据集大小: 206539811.49字节

配置

default: 包含训练数据文件，路径为data/train-*。

任务类别

图像分类

大小类别

100K<n<1M

数据集结构

数据集仅包含训练集，每个样本包含以下字段：

image: 28x28灰度图像，PIL图像格式。
writer_id: 字符串，每个书写者的唯一标识。
hsf_id: 字符串，对应数据收集方式。
character: 分类标签，可通过femnist["train"].features["character"].int2str(value)转换为原始值。

数据集创建

来源数据

NIST SD 19: 数据集的原始来源。

数据收集和处理

预处理细节请参考原始论文和NIST SD 19。

引用

BibTeX

@article{DBLP:journals/corr/abs-1812-01097, author = {Sebastian Caldas and Peter Wu and Tian Li and Jakub Kone{v{c}}n{y} and H. Brendan McMahan and Virginia Smith and Ameet Talwalkar}, title = {{LEAF:} {A} Benchmark for Federated Settings}, journal = {CoRR}, volume = {abs/1812.01097}, year = {2018}, url = {http://arxiv.org/abs/1812.01097}, eprinttype = {arXiv}, eprint = {1812.01097}, timestamp = {Wed, 23 Dec 2020 09:35:18 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1812-01097.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

@article{DBLP:journals/corr/abs-2007-14390, author = {Daniel J. Beutel and Taner Topal and Akhil Mathur and Xinchi Qiu and Titouan Parcollet and Nicholas D. Lane}, title = {Flower: {A} Friendly Federated Learning Research Framework}, journal = {CoRR}, volume = {abs/2007.14390}, year = {2020}, url = {https://arxiv.org/abs/2007.14390}, eprinttype = {arXiv}, eprint = {2007.14390}, timestamp = {Mon, 03 Aug 2020 14:32:13 +0200}, biburl = {https://dblp.org/rec/journals/corr/abs-2007-14390.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }

搜集汇总

数据集介绍

构建方式

FEMNIST数据集是由LEAF基准组织构建的，它是一个图像分类数据集，包含了手写的数字和大小写字母，共62个独特的标签。数据集的构建基于NIST SD 19数据集的预处理版本，每个样本由一个28x28的灰度图像、书写者ID、HSF ID和字符标签组成。该数据集的构建旨在模拟MNIST数据集的预处理流程，以适应联邦学习的环境。

特点

FEMNIST数据集的特点在于其适用于联邦学习场景，数据集以书写者ID进行自然划分，有利于模拟分布式设备上的学习。此外，数据集的多样性和标签的广泛性使其在图像分类任务中具有较高的研究价值。每个样本都包含了丰富的上下文信息，如书写者ID和HSF ID，这些信息可以用于更复杂的分析任务。

使用方法

使用FEMNIST数据集时，推荐通过Flower Datasets和Flower框架来进行。用户需要安装Flower Datasets包，然后可以通过Flower框架加载和分割数据集。数据集提供了易于访问的PIL图像格式和字符标签，用户可以通过转换函数将标签的整数表示转换回原始字符。对于需要集中评估的情况，可以使用Resplitter工具来预留数据集的一部分。

背景与挑战

背景概述

FEMNIST数据集，作为LEAF基准的一部分，专注于手写数字及大小写字母的图像分类任务，共包含62个独特标签。该数据集由卡内基梅隆大学的LEAF项目团队负责编纂，并于2018年作为联邦学习环境下的重要资源推出。其核心研究问题在于探索如何在分布式网络中，各个节点仅基于本地数据进行模型训练，而实现全局模型的优化。FEMNIST数据集的创建，旨在为联邦学习领域提供一种新的视角和工具，从而推动相关研究的深入，并在学术界和工业界产生了广泛的影响。

当前挑战

FEMNIST数据集面临的挑战主要涉及两个方面：一是领域问题上的挑战，即在保护数据隐私的同时，如何高效地进行模型训练和评估；二是构建过程中的挑战，包括数据预处理、标签一致性校验以及跨不同书写者的数据分布均衡性。这些挑战要求研究者在设计联邦学习算法时，既要考虑算法的有效性和准确性，也要兼顾数据隐私和通信效率。

常用场景

经典使用场景

在图像分类领域中，FEMNIST数据集的经典使用场景在于，它提供了书写者身份与手写数字及字母的图像，可用于研究书写者识别与字符识别的联合学习任务。该数据集特别适用于联邦学习环境，能够在保护数据隐私的同时，实现模型在分布式设备上的训练与优化。

解决学术问题

FEMNIST数据集解决了联邦学习中的数据隐私保护与模型泛化能力提升问题。它允许研究者在保持数据本地化的前提下，通过联邦学习框架进行模型训练，这对于研究如何在分布式网络环境中平衡隐私与学习效率具有重要价值。

衍生相关工作

基于FEMNIST数据集，学术界衍生出了一系列相关工作，包括对手写数字及字母识别准确性的提升，对联邦学习算法的改进，以及对数据隐私保护措施的深入研究。这些研究进一步推动了联邦学习技术在现实世界应用中的发展。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集