five

common-canvas/commoncatalog-cc-by-sa

收藏
Hugging Face2024-05-16 更新2024-05-25 收录
下载链接:
https://hf-mirror.com/datasets/common-canvas/commoncatalog-cc-by-sa
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-sa-4.0 dataset_info: features: - name: jpg dtype: image - name: blip2_caption dtype: string - name: caption dtype: string - name: licensename dtype: string - name: licenseurl dtype: string - name: width dtype: int32 - name: height dtype: int32 - name: original_width dtype: int32 - name: original_height dtype: int32 - name: photoid dtype: int64 - name: uid dtype: string - name: unickname dtype: string - name: datetaken dtype: timestamp[us] - name: dateuploaded dtype: int64 - name: capturedevice dtype: string - name: title dtype: string - name: usertags dtype: string - name: machinetags dtype: string - name: longitude dtype: float64 - name: latitude dtype: float64 - name: accuracy dtype: int64 - name: pageurl dtype: string - name: downloadurl dtype: string - name: serverid dtype: int64 - name: farmid dtype: int64 - name: secret dtype: string - name: secretoriginal dtype: string - name: ext dtype: string - name: url dtype: string - name: key dtype: string - name: status dtype: string - name: error_message dtype: string - name: exif dtype: string - name: sha256 dtype: string - name: description dtype: string task_categories: - text-to-image language: - en --- # Dataset Card for CommonCatalog CC-BY-SA This dataset is a large collection of high-resolution Creative Common images (composed of different licenses, see paper Table 1 in the Appendix) collected in 2014 from users of Yahoo Flickr. The dataset contains images of up to 4k resolution, making this one of the highest resolution captioned image datasets. ## Dataset Details ### Dataset Description We provide captions synthetic captions to approximately 100 million high resolution images collected from Yahoo Flickr Creative Commons (YFCC). - **Curated by:** Aaron Gokaslan - **Language(s) (NLP):** en - **License:** See relevant yaml tag / dataset name. ### Dataset Sources <!-- Provide the basic links for the dataset. --> - **Repository:** https://github.com/mosaicml/diffusion - **Paper:** https://arxiv.org/abs/2310.16825 - **Demo:** See CommonCanvas Gradios ## Uses We use CommonCatalog to train a family latent diffusion models called CommonCanvas. The goal is to produce a model that is competitive with Stable Diffusion 2, but to do so using an easily accessible dataset of known provenance. Doing so makes replicating the model significantly easier, and provides a clearer mechanism for applying training-data attribution techniques. ### Direct Use Training text-to-image models Training image-to-text models ### Out-of-Scope Use <!-- This section addresses misuse, malicious use, and uses that the dataset will not work well for. --> * Crafting content that is offensive or injurious towards individuals, including negative portrayals of their living conditions, cultural backgrounds, religious beliefs, etc. * Deliberately creating or spreading content that is discriminatory or reinforces harmful stereotypes. * Falsely representing individuals without their permission. * Generating sexual content that may be seen by individuals without their consent. * Producing or disseminating false or misleading information. * Creating content that depicts extreme violence or bloodshed. * Distributing content that modifies copyrighted or licensed material in a way that breaches its usage terms. ## Dataset Structure The dataset is divided into 10 subsets each containing parquets about 4GB each. Each subfolder within contains a resolution range of the images and their respective aspect ratios. The dataset is also divided along images licensed for commercial use (C) and those that are not (NC). ## Dataset Creation ### Curation Rationale Creating a standardized, accessible dataset with synthetic caption and releasing it so other people can train on a common dataset for open source image generation. ### Source Data Yahoo Flickr Creative Commons 100M Dataset and Synthetically Generated Caption Data. #### Data Collection and Processing All synthetic captions were generated with BLIP2. See paper for more details. #### Who are the source data producers? <!-- This section describes the people or systems who originally created the data. It should also include self-reported demographic or identity information for the source data creators if this information is available. --> Users of Flickr ## Bias, Risks, and Limitations See Yahoo Flickr Creative Commons 100M dataset for more information. The information was collected circa 2014 and known to have a bias towards internet connected Western countries. Some areas such as the global south lack representation. ## Citation **BibTeX:** ``` @article{gokaslan2023commoncanvas, title={CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images}, author={Gokaslan, Aaron and Cooper, A Feder and Collins, Jasmine and Seguin, Landan and Jacobson, Austin and Patel, Mihir and Frankle, Jonathan and Stephenson, Cory and Kuleshov, Volodymyr}, journal={arXiv preprint arXiv:2310.16825}, year={2023} } ``` ## Dataset Card Authors [Aaron Gokaslan](https://huggingface.co/Skylion007) ## Dataset Card Contact [Aaron Gokaslan](https://huggingface.co/Skylion007)
提供机构:
common-canvas
原始信息汇总

数据集概述

数据集名称

CommonCatalog CC-BY-SA

数据集描述

这是一个包含高分辨率Creative Common图像的大型数据集,收集于2014年,来源于Yahoo Flickr用户。数据集包含高达4k分辨率的图像,是最高分辨率的带标题图像数据集之一。

数据集特征

  • 图像分辨率:高达4k
  • 图像数量:约1亿张
  • 图像来源:Yahoo Flickr Creative Commons
  • 附加信息:每张图像附带合成标题

数据集内容

  • 图像格式:jpg
  • 文本信息:blip2_caption, caption, licensename, licenseurl, title, usertags, machinetags, description
  • 元数据:width, height, original_width, original_height, photoid, uid, unickname, datetaken, dateuploaded, capturedevice, longitude, latitude, accuracy, pageurl, downloadurl, serverid, farmid, secret, secretoriginal, ext, url, key, status, error_message, exif, sha256

数据集用途

  • 直接用途:训练文本到图像模型,训练图像到文本模型
  • 非直接用途:不应用于创建或传播有害、歧视性内容,不应用于未经授权的个人代表,不应用于生成未经同意的性内容,不应用于传播虚假或误导信息,不应用于创建极端暴力或血腥内容,不应用于修改版权或授权材料以违反其使用条款。

数据集结构

数据集分为10个部分,每个部分包含约4GB的parquets文件。每个子文件夹内包含不同分辨率和相应宽高比的图像。数据集还根据商业使用许可(C)和非商业使用许可(NC)进行划分。

数据集创建

  • 来源数据:Yahoo Flickr Creative Commons 100M Dataset和合成生成的标题数据
  • 数据处理:所有合成标题由BLIP2生成
  • 数据生产者:Flickr用户

许可证

CC-BY-SA-4.0

语言

英语(en)

数据集作者

Aaron Gokaslan

数据集联系人

Aaron Gokaslan

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作