PD12M
收藏魔搭社区2025-12-18 更新2024-11-02 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/PD12M
下载链接
链接失效反馈官方服务:
资源简介:
# PD12M

# Summary
At 12.4 million image-caption pairs, PD12M is the largest public domain image-text dataset to date, with sufficient size to train foundation models while minimizing copyright concerns. Through the Source.Plus platform, we also introduce novel, community-driven dataset governance mechanisms that reduce harm and support reproducibility over time.
[Jordan Meyer](https://linkedin.com/in/jordanmeyer) [Nicholas Padgett](https://www.linkedin.com/in/nicholas-padgett-36a921a0/) [Cullen Miller](https://www.linkedin.com/in/cullen-miller-312941290/) [Laura Exline](https://www.linkedin.com/in/lauraexline/)
[Paper](https://arxiv.org/abs/2410.23144) [Datasheet](https://huggingface.co/datasets/Spawning/PD12M/blob/main/Datasheet.pdf) [Project](https://source.plus/pd12m)
# About
PD12M was built and curated with [Source.Plus](https://source.plus) with the aim of resolving many of the data quality issues that arise in web-scraped training data: the presence of copyrighted material, low quality images and captions, violent or nsfw content, PII, decaying dataset quality via broken links, etc.
PD12M consists of entirely public domain and CC0 licensed images, with automated recaptioning of image data, and quality and safety filtering. Images in PD12M are also hosted on dedicated cloud storage, separate from the original image hosts, to avoid placing an undue burden on those hosts or impacting services for regular users. This also ensures the dataset remains wholly intact over its lifetime.
# Overview
This dataset has two components. The first is the `metadata`, which contains the image urls, captions, image dimensions, etc. The second component are the `images`.
## Metadata
The metadata is made available through a series of parquet files with the following schema:
- `id`: A unique identifier for the image.
- `url`: The URL of the image.
- `caption`: A caption for the image.
- `width`: The width of the image in pixels.
- `height`: The height of the image in pixels.
- `mime_type`: The MIME type of the image file.
- `hash`: The MD5 hash of the image file.
- `license`: The URL of the image license.
- `source` : The source organization of the image.
Additionally, CLIP Vit-L/14 embeddings are provided in the `embeddings` directory.
## Images
The image files are all hosted in the AWS S3 bucket `pd12m`. The URLs to the images files are all maintained in the metadata files.
# Tutorials
[Working with the Metadata](./tutorials/metadata.md)
[Downloading Images](./tutorials/images.md)
[Working with the Embeddings](./tutorials/embeddings.md)
# License
The dataset is licensed under the [CDLA-Permissive-2.0](https://cdla.dev/permissive-2-0/).
# Reporting Issues
We've gone through great lengths to ensure the dataset is free from objectionable and infringing content. If you find any issues or have any concerns, please flag the item in [Source.Plus](https://source.plus/collection/pd12m-mxenifxs), where our review process will remove the infringing material, and find a suitable replacement.
# Citation
@misc{meyer2024publicdomain12mhighly,
title={Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms},
author={Jordan Meyer and Nick Padgett and Cullen Miller and Laura Exline},
year={2024},
eprint={2410.23144},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.23144},
}
# PD12M

# 概述
截至目前,PD12M是规模最大的公有领域图像-文本数据集,包含1240万条图像-文本配对样本,其体量足以支撑基础模型的训练,同时最大限度降低版权顾虑。依托Source.Plus平台,我们还推出了全新的社区驱动型数据集治理机制,可减少有害内容生成,并长期保障研究可复现性。
[Jordan Meyer](https://linkedin.com/in/jordanmeyer) [Nicholas Padgett](https://www.linkedin.com/in/nicholas-padgett-36a921a0/) [Cullen Miller](https://www.linkedin.com/in/cullen-miller-312941290/) [Laura Exline](https://www.linkedin.com/in/lauraexline/)
[Paper](https://arxiv.org/abs/2410.23144) [Datasheet](https://huggingface.co/datasets/Spawning/PD12M/blob/main/Datasheet.pdf) [Project](https://source.plus/pd12m)
# 关于本数据集
PD12M由[Source.Plus](https://source.plus)参与构建与筛选整理,旨在解决网络爬取训练数据中常见的诸多数据质量问题:受版权保护的材料、低质量图像与文本描述、暴力或不适宜内容(Not Safe For Work, NSFW)、个人可识别信息(Personally Identifiable Information, PII),以及因链接失效导致的数据集质量随时间衰减等问题。
PD12M的所有图像均为公有领域或CC0许可授权的作品,我们对图像数据进行了自动文本描述生成,并完成了质量与安全过滤。此外,PD12M的图像托管于专属云存储中,与原始图像服务器分离,既不会给原始服务器带来额外负担,也不会影响普通用户的服务使用。这一设计同时确保了数据集在其生命周期内始终保持完整可用。
# 数据集结构
本数据集包含两个组成部分:其一为**元数据(metadata)**,包含图像URL、文本描述、图像尺寸等信息;其二为**图像文件(images)**。
## 元数据
元数据以一系列Parquet文件形式发布,其模式(schema)如下:
- `id`: 图像的唯一标识符
- `url`: 图像的URL地址
- `caption`: 图像的文本描述
- `width`: 图像的像素宽度
- `height`: 图像的像素高度
- `mime_type`: 图像文件的MIME类型
- `hash`: 图像文件的MD5哈希值
- `license`: 图像许可协议的URL
- `source`: 图像的来源机构
此外,`embeddings`(嵌入向量)目录中提供了CLIP Vit-L/14的嵌入向量。
## 图像文件
所有图像文件均托管于AWS S3存储桶`pd12m`中,图像文件的URL均已在元数据文件中记录。
# 使用教程
[元数据操作指南](./tutorials/metadata.md)
[图像下载指南](./tutorials/images.md)
[嵌入向量使用指南](./tutorials/embeddings.md)
# 许可协议
本数据集采用[CDLA-Permissive-2.0](https://cdla.dev/permissive-2-0/)许可协议进行授权。
# 问题反馈
我们已尽最大努力确保本数据集不含违规或令人反感的内容。若您发现任何问题或有任何疑虑,请在[Source.Plus平台](https://source.plus/collection/pd12m-mxenifxs)中标注相关条目,我们的审核团队将移除侵权内容并替换为合适的替代素材。
# 引用格式
bibtex
@misc{meyer2024publicdomain12mhighly,
title={Public Domain 12M: A Highly Aesthetic Image-Text Dataset with Novel Governance Mechanisms},
author={Jordan Meyer and Nick Padgett and Cullen Miller and Laura Exline},
year={2024},
eprint={2410.23144},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.23144},
}
提供机构:
maas
创建时间:
2024-11-01



