stable-bias/identities
收藏Hugging Face2023-08-21 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/stable-bias/identities
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
dataset_info:
features:
- name: ethnicity
dtype: string
- name: gender
dtype: string
- name: 'no'
dtype: int32
- name: image_path
dtype: string
- name: image
dtype: image
- name: model
dtype: string
splits:
- name: train
num_bytes: 585336673
num_examples: 2040
download_size: 465986042
dataset_size: 585336673
language:
- en
---
# Dataset Card for identities
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks](#supported-tasks)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases and Limitations](#discussion-of-biases-and-limitations)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://huggingface.co/spaces/tti-bias/stable-bias
- **Repository:** Forthcoming
- **Paper:** Forthcoming
- **Leaderboard:**
- **Point of Contact:** Forthcoming
### Dataset Summary
<span style="color:blue" font-size:smaller><b>🏗️ WORK IN PROGRESS</b></span>
<span style="color:red" font-size:smaller><b>⚠️ DISCLAIMER: The images in this dataset were generated by text-to-image systems and may depict offensive stereotypes or contain explicit content.</b></span>
The *Identities* dataset is a collection of computer-generated images generated using Text-to-Image (TTI) systems.
In order to generate a diverse set of prompts to evaluate the system outputs’ variation across dimensions of interest, we use the pattern **`Photo portrait of a [X] [Y] at work`**, where **[X]** spans ... and **[Y]** spans ....
```python
["American_Indian", "Black"]
```
```python
["woman", "man", "non-binary", "no_gender_specified"]
# no_gender_specified corresponds to a value of "person" for **[Y]**
```
Every prompt is used to generate images from the following models: **Stable Diffusion v.1.4, Stable Diffusion v.2., and Dall-E 2**
### Supported Tasks
This dataset can be used to evaluate the output space of TTI systems, particularly against the backdrop of societal representativeness.
### Languages
The prompts that generated the images are all in US-English.
## Dataset Structure
The dataset is stored in `parquet` format and contains 2040 rows which can be loaded like so:
```python
from datasets import load_dataset
dataset = load_dataset("tti-bias/professions", split="train")
```
### Data Fields
Each row corresponds to the output of a TTI system and looks as follows:
```python
{
'ethnicity': 'South_Asian',
'gender': 'man',
'no': 1,
'image_path': 'Photo_portrait_of_a_South_Asian_man_at_work/Photo_portrait_of_a_South_Asian_man_at_work_1.jpg',
'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512>,
'model': 'SD_2'
}
```
### Data Splits
All the data is contained within the `train` split. As such, the dataset contains practically no splits.
## Dataset Creation
### Curation Rationale
This dataset was created to explore the output characteristics of TTI systems from the vantage point of societal characteristics of interest.
### Source Data
#### Initial Data Collection and Normalization
The data was generated using the [DiffusionPipeline]() from Hugging Face:
```python
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
images = pipeline(prompt="Photo portrait of an African woman at work", num_images_per_prompt=9).images
```
### Personal and Sensitive Information
Generative models trained on large datasets have been shown to memorize part of their training sets (See e.g.: [(Carlini et al. 2023)](https://arxiv.org/abs/2301.13188)) and the people generated could theoretically bear resemblance to real people.
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases and Limitations
At this point in time, the data is limited to images generated using English prompts and a set of professions sourced form the U.S. Bureau of Labor Statistics (BLS), which also provides us with additional information such as the demographic characteristics and salaries of each profession. While this data can also be leveraged in interesting analyses, it is currently limited to the North American context.
## Additional Information
### Licensing Information
The dataset is licensed under the Creative Commons [Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/) license.
### Citation Information
If you use this dataset in your own work, please consider citing:
```json
@article{stable-bias-authors-2023,
author = {Anonymous Authors},
title = {Stable Bias: Analyzing Societal Representations in Diffusion Models},
year = {2023},
}
```
提供机构:
stable-bias
原始信息汇总
数据集概述
数据集名称
- Identities
数据集描述
- 该数据集包含由Text-to-Image (TTI)系统生成的计算机生成图像。
- 使用特定模式生成多样化的提示,以评估系统输出在感兴趣维度上的变化。
数据集特征
- ethnicity (字符串)
- gender (字符串)
- no (整数,类型为int32)
- image_path (字符串)
- image (图像类型)
- model (字符串)
数据集结构
- 存储格式:
parquet - 包含2040行数据
- 数据加载示例: python from datasets import load_dataset dataset = load_dataset("tti-bias/professions", split="train")
数据分割
- 所有数据包含在
train分割中,无其他分割。
数据集使用
- 可用于评估TTI系统的输出空间,特别是在社会代表性方面的表现。
语言
- 生成图像的提示语言为美国英语。
许可证
引用信息
- 引用格式: json @article{stable-bias-authors-2023, author = {Anonymous Authors}, title = {Stable Bias: Analyzing Societal Representations in Diffusion Models}, year = {2023}, }



