common-canvas/commoncatalog-cc-by-sa

Name: common-canvas/commoncatalog-cc-by-sa
Creator: common-canvas
Published: 2024-05-16 19:41:37
License: 暂无描述

Hugging Face2024-05-16 更新2024-05-25 收录

下载链接：

https://hf-mirror.com/datasets/common-canvas/commoncatalog-cc-by-sa

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-sa-4.0 dataset_info: features: - name: jpg dtype: image - name: blip2_caption dtype: string - name: caption dtype: string - name: licensename dtype: string - name: licenseurl dtype: string - name: width dtype: int32 - name: height dtype: int32 - name: original_width dtype: int32 - name: original_height dtype: int32 - name: photoid dtype: int64 - name: uid dtype: string - name: unickname dtype: string - name: datetaken dtype: timestamp[us] - name: dateuploaded dtype: int64 - name: capturedevice dtype: string - name: title dtype: string - name: usertags dtype: string - name: machinetags dtype: string - name: longitude dtype: float64 - name: latitude dtype: float64 - name: accuracy dtype: int64 - name: pageurl dtype: string - name: downloadurl dtype: string - name: serverid dtype: int64 - name: farmid dtype: int64 - name: secret dtype: string - name: secretoriginal dtype: string - name: ext dtype: string - name: url dtype: string - name: key dtype: string - name: status dtype: string - name: error_message dtype: string - name: exif dtype: string - name: sha256 dtype: string - name: description dtype: string task_categories: - text-to-image language: - en --- # Dataset Card for CommonCatalog CC-BY-SA This dataset is a large collection of high-resolution Creative Common images (composed of different licenses, see paper Table 1 in the Appendix) collected in 2014 from users of Yahoo Flickr. The dataset contains images of up to 4k resolution, making this one of the highest resolution captioned image datasets. ## Dataset Details ### Dataset Description We provide captions synthetic captions to approximately 100 million high resolution images collected from Yahoo Flickr Creative Commons (YFCC). - **Curated by:** Aaron Gokaslan - **Language(s) (NLP):** en - **License:** See relevant yaml tag / dataset name. ### Dataset Sources  - **Repository:** https://github.com/mosaicml/diffusion - **Paper:** https://arxiv.org/abs/2310.16825 - **Demo:** See CommonCanvas Gradios ## Uses We use CommonCatalog to train a family latent diffusion models called CommonCanvas. The goal is to produce a model that is competitive with Stable Diffusion 2, but to do so using an easily accessible dataset of known provenance. Doing so makes replicating the model significantly easier, and provides a clearer mechanism for applying training-data attribution techniques. ### Direct Use Training text-to-image models Training image-to-text models ### Out-of-Scope Use  * Crafting content that is offensive or injurious towards individuals, including negative portrayals of their living conditions, cultural backgrounds, religious beliefs, etc. * Deliberately creating or spreading content that is discriminatory or reinforces harmful stereotypes. * Falsely representing individuals without their permission. * Generating sexual content that may be seen by individuals without their consent. * Producing or disseminating false or misleading information. * Creating content that depicts extreme violence or bloodshed. * Distributing content that modifies copyrighted or licensed material in a way that breaches its usage terms. ## Dataset Structure The dataset is divided into 10 subsets each containing parquets about 4GB each. Each subfolder within contains a resolution range of the images and their respective aspect ratios. The dataset is also divided along images licensed for commercial use (C) and those that are not (NC). ## Dataset Creation ### Curation Rationale Creating a standardized, accessible dataset with synthetic caption and releasing it so other people can train on a common dataset for open source image generation. ### Source Data Yahoo Flickr Creative Commons 100M Dataset and Synthetically Generated Caption Data. #### Data Collection and Processing All synthetic captions were generated with BLIP2. See paper for more details. #### Who are the source data producers?  Users of Flickr ## Bias, Risks, and Limitations See Yahoo Flickr Creative Commons 100M dataset for more information. The information was collected circa 2014 and known to have a bias towards internet connected Western countries. Some areas such as the global south lack representation. ## Citation **BibTeX:** ``` @article{gokaslan2023commoncanvas, title={CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images}, author={Gokaslan, Aaron and Cooper, A Feder and Collins, Jasmine and Seguin, Landan and Jacobson, Austin and Patel, Mihir and Frankle, Jonathan and Stephenson, Cory and Kuleshov, Volodymyr}, journal={arXiv preprint arXiv:2310.16825}, year={2023} } ``` ## Dataset Card Authors [Aaron Gokaslan](https://huggingface.co/Skylion007) ## Dataset Card Contact [Aaron Gokaslan](https://huggingface.co/Skylion007)

提供机构：

common-canvas

原始信息汇总

数据集概述

数据集名称

CommonCatalog CC-BY-SA

数据集描述

这是一个包含高分辨率Creative Common图像的大型数据集，收集于2014年，来源于Yahoo Flickr用户。数据集包含高达4k分辨率的图像，是最高分辨率的带标题图像数据集之一。

数据集特征

图像分辨率：高达4k
图像数量：约1亿张
图像来源：Yahoo Flickr Creative Commons
附加信息：每张图像附带合成标题

数据集内容

图像格式：jpg
文本信息：blip2_caption, caption, licensename, licenseurl, title, usertags, machinetags, description
元数据：width, height, original_width, original_height, photoid, uid, unickname, datetaken, dateuploaded, capturedevice, longitude, latitude, accuracy, pageurl, downloadurl, serverid, farmid, secret, secretoriginal, ext, url, key, status, error_message, exif, sha256

数据集用途

直接用途：训练文本到图像模型，训练图像到文本模型
非直接用途：不应用于创建或传播有害、歧视性内容，不应用于未经授权的个人代表，不应用于生成未经同意的性内容，不应用于传播虚假或误导信息，不应用于创建极端暴力或血腥内容，不应用于修改版权或授权材料以违反其使用条款。

数据集结构

数据集分为10个部分，每个部分包含约4GB的parquets文件。每个子文件夹内包含不同分辨率和相应宽高比的图像。数据集还根据商业使用许可（C）和非商业使用许可（NC）进行划分。

数据集创建

来源数据：Yahoo Flickr Creative Commons 100M Dataset和合成生成的标题数据
数据处理：所有合成标题由BLIP2生成
数据生产者：Flickr用户

许可证

CC-BY-SA-4.0

语言

英语（en）

数据集作者

Aaron Gokaslan

数据集联系人

Aaron Gokaslan

5,000+

优质数据集

54 个

任务类型

进入经典数据集