common-canvas/commoncatalog-cc-by-sa
收藏Hugging Face2024-05-16 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/common-canvas/commoncatalog-cc-by-sa
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
dataset_info:
features:
- name: jpg
dtype: image
- name: blip2_caption
dtype: string
- name: caption
dtype: string
- name: licensename
dtype: string
- name: licenseurl
dtype: string
- name: width
dtype: int32
- name: height
dtype: int32
- name: original_width
dtype: int32
- name: original_height
dtype: int32
- name: photoid
dtype: int64
- name: uid
dtype: string
- name: unickname
dtype: string
- name: datetaken
dtype: timestamp[us]
- name: dateuploaded
dtype: int64
- name: capturedevice
dtype: string
- name: title
dtype: string
- name: usertags
dtype: string
- name: machinetags
dtype: string
- name: longitude
dtype: float64
- name: latitude
dtype: float64
- name: accuracy
dtype: int64
- name: pageurl
dtype: string
- name: downloadurl
dtype: string
- name: serverid
dtype: int64
- name: farmid
dtype: int64
- name: secret
dtype: string
- name: secretoriginal
dtype: string
- name: ext
dtype: string
- name: url
dtype: string
- name: key
dtype: string
- name: status
dtype: string
- name: error_message
dtype: string
- name: exif
dtype: string
- name: sha256
dtype: string
- name: description
dtype: string
task_categories:
- text-to-image
language:
- en
---
# Dataset Card for CommonCatalog CC-BY-SA
This dataset is a large collection of high-resolution Creative Common images (composed of different licenses, see paper Table 1 in the Appendix) collected in 2014 from users of Yahoo Flickr.
The dataset contains images of up to 4k resolution, making this one of the highest resolution captioned image datasets.
## Dataset Details
### Dataset Description
We provide captions synthetic captions to approximately 100 million high resolution images collected from Yahoo Flickr Creative Commons (YFCC).
- **Curated by:** Aaron Gokaslan
- **Language(s) (NLP):** en
- **License:** See relevant yaml tag / dataset name.
### Dataset Sources
<!-- Provide the basic links for the dataset. -->
- **Repository:** https://github.com/mosaicml/diffusion
- **Paper:** https://arxiv.org/abs/2310.16825
- **Demo:** See CommonCanvas Gradios
## Uses
We use CommonCatalog to train a family latent diffusion models called CommonCanvas.
The goal is to produce a model that is competitive with Stable Diffusion 2, but to do so using an easily accessible dataset of known provenance.
Doing so makes replicating the model significantly easier, and provides a clearer mechanism for applying training-data attribution techniques.
### Direct Use
Training text-to-image models
Training image-to-text models
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. -->
* Crafting content that is offensive or injurious towards individuals, including negative portrayals of their living conditions, cultural backgrounds, religious beliefs, etc.
* Deliberately creating or spreading content that is discriminatory or reinforces harmful stereotypes.
* Falsely representing individuals without their permission.
* Generating sexual content that may be seen by individuals without their consent.
* Producing or disseminating false or misleading information.
* Creating content that depicts extreme violence or bloodshed.
* Distributing content that modifies copyrighted or licensed material in a way that breaches its usage terms.
## Dataset Structure
The dataset is divided into 10 subsets each containing parquets about 4GB each. Each subfolder within contains a resolution range of the images and their respective aspect ratios.
The dataset is also divided along images licensed for commercial use (C) and those that are not (NC).
## Dataset Creation
### Curation Rationale
Creating a standardized, accessible dataset with synthetic caption and releasing it so other people can train on a common dataset for open source image generation.
### Source Data
Yahoo Flickr Creative Commons 100M Dataset and Synthetically Generated Caption Data.
#### Data Collection and Processing
All synthetic captions were generated with BLIP2. See paper for more details.
#### Who are the source data producers?
<!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. -->
Users of Flickr
## Bias, Risks, and Limitations
See Yahoo Flickr Creative Commons 100M dataset for more information. The information was collected circa 2014 and known to have a bias towards internet connected Western countries. Some areas such as the global south lack representation.
## Citation
**BibTeX:**
```
@article{gokaslan2023commoncanvas,
title={CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images},
author={Gokaslan, Aaron and Cooper, A Feder and Collins, Jasmine and Seguin, Landan and Jacobson, Austin and Patel, Mihir and Frankle, Jonathan and Stephenson, Cory and Kuleshov, Volodymyr},
journal={arXiv preprint arXiv:2310.16825},
year={2023}
}
```
## Dataset Card Authors
[Aaron Gokaslan](https://huggingface.co/Skylion007)
## Dataset Card Contact
[Aaron Gokaslan](https://huggingface.co/Skylion007)
提供机构:
common-canvas
原始信息汇总
数据集概述
数据集名称
CommonCatalog CC-BY-SA
数据集描述
这是一个包含高分辨率Creative Common图像的大型数据集,收集于2014年,来源于Yahoo Flickr用户。数据集包含高达4k分辨率的图像,是最高分辨率的带标题图像数据集之一。
数据集特征
- 图像分辨率:高达4k
- 图像数量:约1亿张
- 图像来源:Yahoo Flickr Creative Commons
- 附加信息:每张图像附带合成标题
数据集内容
- 图像格式:jpg
- 文本信息:blip2_caption, caption, licensename, licenseurl, title, usertags, machinetags, description
- 元数据:width, height, original_width, original_height, photoid, uid, unickname, datetaken, dateuploaded, capturedevice, longitude, latitude, accuracy, pageurl, downloadurl, serverid, farmid, secret, secretoriginal, ext, url, key, status, error_message, exif, sha256
数据集用途
- 直接用途:训练文本到图像模型,训练图像到文本模型
- 非直接用途:不应用于创建或传播有害、歧视性内容,不应用于未经授权的个人代表,不应用于生成未经同意的性内容,不应用于传播虚假或误导信息,不应用于创建极端暴力或血腥内容,不应用于修改版权或授权材料以违反其使用条款。
数据集结构
数据集分为10个部分,每个部分包含约4GB的parquets文件。每个子文件夹内包含不同分辨率和相应宽高比的图像。数据集还根据商业使用许可(C)和非商业使用许可(NC)进行划分。
数据集创建
- 来源数据:Yahoo Flickr Creative Commons 100M Dataset和合成生成的标题数据
- 数据处理:所有合成标题由BLIP2生成
- 数据生产者:Flickr用户
许可证
CC-BY-SA-4.0
语言
英语(en)
数据集作者
Aaron Gokaslan
数据集联系人
Aaron Gokaslan



