pd12m
收藏魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/Intelligent-Internet/pd12m
下载链接
链接失效反馈官方服务:
资源简介:
# `PD12M`
This is a curated PD12M dataset for use with the [II-Commons](https://github.com/Intelligent-Internet/II-Commons) project.
## Dataset Details
### Dataset Description
This dataset comprises a curated [Public Domain 12M](https://source.plus/pd12m) image collection, refined by filtering for active image links. EXIF data was extracted, and images underwent preprocessing and feature extraction using [SigLIP 2](https://huggingface.co/papers/2502.14786). All vector embeddings are normalized 16-bit half-precision vectors optimized for L2 indexing with [vectorchord](https://github.com/tensorchord/vectorchord).
### Dataset Sources
This dataset is derived and organized from [Spawning/PD12M](http://huggingface.co/datasets/Spawning/PD12M). The original license information for the image can be found in the corresponding entry of the original database.
## Dataset Structure
- id: A unique identifier for the image.
- url: The URL of the image.
- caption: A caption for the image.
- caption_long: A long caption for the image.
- origin_width: The width of the original image in pixels.
- origin_height: The height of the original image in pixels.
- processed_width: The width of the processed image in pixels.
- processed_height: The height of the processed image in pixels.
- aspect_ratio: The aspect ratio of the image.
- exif: The EXIF data of the image.
- meta: The metadata of the image.
- created_at: The creation time of the image.
- updated_at: The update time of the image.
- source: The source organization of the image.
- vector: The vector embedding of the image.
- origin_source: The origin source of the image.
- license: The license of the image.
## Prerequisite
PostgreSQL 17 with extensions: [vectorchord](https://github.com/tensorchord/VectorChord) and [pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search)
The easiest way is to use our [Docker image](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db), or build your own. Then load the [psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/pd12m/tree/psql_basebackup) branch, following the [Quick Start](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start)
Ensure extensions are enabled, connect to the database using the psql, and run the following SQL:
```sql
CREATE EXTENSION IF NOT EXISTS vchord CASCADE;
CREATE EXTENSION IF NOT EXISTS pg_search CASCADE;
```
## Uses
This dataset is available for a wide range of applications.
Here is a demo of how to use the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons).
### Create a Table in PostgreSQL
```sql
CREATE TABLE IF NOT EXISTS is_pd12m (
id BIGSERIAL PRIMARY KEY,
url VARCHAR NOT NULL,
caption VARCHAR NOT NULL DEFAULT '',
caption_long VARCHAR DEFAULT '',
origin_width BIGINT NOT NULL DEFAULT 0,
origin_height BIGINT NOT NULL DEFAULT 0,
processed_width BIGINT NOT NULL DEFAULT 0,
processed_height BIGINT NOT NULL DEFAULT 0,
aspect_ratio DOUBLE PRECISION NOT NULL DEFAULT 0,
exif JSONB NOT NULL DEFAULT '{}',
meta JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
source JSONB NOT NULL DEFAULT '[]',
vector halfvec(1152) DEFAULT NULL,
origin_source VARCHAR DEFAULT '',
license VARCHAR DEFAULT ''
);
```
### Load csv files to database
1. Load the dataset from local file system to a remote PostgreSQL server:
```sql
\copy is_pd12m FROM 'data/0000000.csv' CSV HEADER;
\copy is_pd12m FROM 'data/0000001.csv' CSV HEADER;
\copy is_pd12m FROM 'data/0000002.csv' CSV HEADER;
...
```
2. Load the dataset from the PostgreSQL server's file system:
```sql
copy is_pd12m FROM 'data/0000000.csv' CSV HEADER;
copy is_pd12m FROM 'data/0000001.csv' CSV HEADER;
copy is_pd12m FROM 'data/0000002.csv' CSV HEADER;
...
```
### Create Indexes
You need to create the following indexes for the best performance.
The `vector` column is a halfvec(1152) column, which is a 16-bit half-precision vector optimized for `L2` indexing with [vectorchord](https://github.com/tensorchord/vectorchord). You can get more information about the vector index from the [vectorchord](https://docs.vectorchord.ai/vectorchord/usage/indexing.html) documentation.
```sql
CREATE UNIQUE INDEX IF NOT EXISTS is_pd12m_url_index ON is_pd12m (url);
CREATE INDEX IF NOT EXISTS is_pd12m_origin_width_index ON is_pd12m (origin_width);
CREATE INDEX IF NOT EXISTS is_pd12m_origin_height_index ON is_pd12m (origin_height);
CREATE INDEX IF NOT EXISTS is_pd12m_processed_width_index ON is_pd12m (processed_width);
CREATE INDEX IF NOT EXISTS is_pd12m_processed_height_index ON is_pd12m (processed_height);
CREATE INDEX IF NOT EXISTS is_pd12m_aspect_ratio_index ON is_pd12m (aspect_ratio);
CREATE INDEX IF NOT EXISTS is_pd12m_exif_index ON is_pd12m USING gin(exif);
CREATE INDEX IF NOT EXISTS is_pd12m_meta_index ON is_pd12m USING gin(meta);
CREATE INDEX IF NOT EXISTS is_pd12m_source_index ON is_pd12m USING gin(source);
CREATE INDEX IF NOT EXISTS is_pd12m_created_at_index ON is_pd12m (created_at);
CREATE INDEX IF NOT EXISTS is_pd12m_updated_at_index ON is_pd12m (updated_at);
CREATE INDEX IF NOT EXISTS is_pd12m_vector_index ON is_pd12m USING vchordrq (vector halfvec_l2_ops) WITH (options = $$
residual_quantization = true
[build.internal]
lists = [20000]
build_threads = 6
spherical_centroids = false
$$);
CREATE INDEX IF NOT EXISTS is_pd12m_caption_index ON is_pd12m (caption) WHERE caption = '';
CREATE INDEX IF NOT EXISTS is_pd12m_caption_long_index ON is_pd12m (caption_long) WHERE caption_long = '';
CREATE INDEX IF NOT EXISTS is_pd12m_vector_null_index ON is_pd12m (vector) WHERE vector IS NULL;
```
### Query with II-Commons
Click this link to learn how to query the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons).
# `PD12M`
本数据集为经过精选的PD12M数据集,专为配合[II-Commons](https://github.com/Intelligent-Internet/II-Commons)项目使用而打造。
## 数据集详情
### 数据集描述
本数据集包含经过精选的[公有领域12M(Public Domain 12M)](https://source.plus/pd12m)图像集,通过筛选有效图像链接完成优化精炼。已提取图像的EXIF元数据,并使用[SigLIP 2](https://huggingface.co/papers/2502.14786)对图像执行预处理与特征提取操作。所有向量嵌入均为归一化后的16位半精度向量,针对使用[vectorchord](https://github.com/tensorchord/vectorchord)进行L2索引做了专项优化。
### 数据集来源
本数据集源自[Spawning/PD12M](http://huggingface.co/datasets/Spawning/PD12M)并经整理构建。图像的原始使用许可信息可在原始数据库的对应条目内查阅。
## 数据集结构
- id: 图像的唯一标识符
- url: 图像的URL地址
- caption: 图像简短标题
- caption_long: 图像详细标题
- origin_width: 原始图像的像素宽度
- origin_height: 原始图像的像素高度
- processed_width: 处理后图像的像素宽度
- processed_height: 处理后图像的像素高度
- aspect_ratio: 图像的宽高比
- exif: 图像的EXIF元数据
- meta: 图像的元数据
- created_at: 图像的创建时间
- updated_at: 图像的更新时间
- source: 图像的来源机构
- vector: 图像的向量嵌入
- origin_source: 图像的原始来源
- license: 图像的使用许可
## 前置依赖
需使用搭载以下扩展的PostgreSQL 17:[vectorchord](https://github.com/tensorchord/VectorChord)与[pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search)。
最简部署方式为使用我们提供的[Docker镜像](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db),或自行构建镜像。随后请遵循[快速入门指南](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start)加载[psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/pd12m/tree/psql_basebackup)分支。
请确保扩展已启用,通过psql连接至数据库后,执行以下SQL语句:
sql
CREATE EXTENSION IF NOT EXISTS vchord CASCADE;
CREATE EXTENSION IF NOT EXISTS pg_search CASCADE;
## 使用场景
本数据集可适配多种应用场景。
以下为如何配合[II-Commons](https://github.com/Intelligent-Internet/II-Commons)使用本数据集的演示示例。
### 在PostgreSQL中创建数据表
sql
CREATE TABLE IF NOT EXISTS is_pd12m (
id BIGSERIAL PRIMARY KEY,
url VARCHAR NOT NULL,
caption VARCHAR NOT NULL DEFAULT '',
caption_long VARCHAR DEFAULT '',
origin_width BIGINT NOT NULL DEFAULT 0,
origin_height BIGINT NOT NULL DEFAULT 0,
processed_width BIGINT NOT NULL DEFAULT 0,
processed_height BIGINT NOT NULL DEFAULT 0,
aspect_ratio DOUBLE PRECISION NOT NULL DEFAULT 0,
exif JSONB NOT NULL DEFAULT '{}',
meta JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
source JSONB NOT NULL DEFAULT '[]',
vector halfvec(1152) DEFAULT NULL,
origin_source VARCHAR DEFAULT '',
license VARCHAR DEFAULT ''
);
### 将CSV文件导入数据库
1. 将本地文件系统中的数据集导入远程PostgreSQL服务器:
sql
\copy is_pd12m FROM 'data/0000000.csv' CSV HEADER;
\copy is_pd12m FROM 'data/0000001.csv' CSV HEADER;
\copy is_pd12m FROM 'data/0000002.csv' CSV HEADER;
...
2. 将PostgreSQL服务器文件系统中的数据集导入:
sql
copy is_pd12m FROM 'data/0000000.csv' CSV HEADER;
copy is_pd12m FROM 'data/0000001.csv' CSV HEADER;
copy is_pd12m FROM 'data/0000002.csv' CSV HEADER;
...
### 创建索引
为获得最佳性能,需创建以下索引:
`vector`列为halfvec(1152)类型,即针对使用[vectorchord](https://github.com/tensorchord/vectorchord)进行L2索引优化的16位半精度向量。有关向量索引的更多详情可查阅[vectorchord官方文档](https://docs.vectorchord.ai/vectorchord/usage/indexing.html)。
sql
CREATE UNIQUE INDEX IF NOT EXISTS is_pd12m_url_index ON is_pd12m (url);
CREATE INDEX IF NOT EXISTS is_pd12m_origin_width_index ON is_pd12m (origin_width);
CREATE INDEX IF NOT EXISTS is_pd12m_origin_height_index ON is_pd12m (origin_height);
CREATE INDEX IF NOT EXISTS is_pd12m_processed_width_index ON is_pd12m (processed_width);
CREATE INDEX IF NOT EXISTS is_pd12m_processed_height_index ON is_pd12m (processed_height);
CREATE INDEX IF NOT EXISTS is_pd12m_aspect_ratio_index ON is_pd12m (aspect_ratio);
CREATE INDEX IF NOT EXISTS is_pd12m_exif_index ON is_pd12m USING gin(exif);
CREATE INDEX IF NOT EXISTS is_pd12m_meta_index ON is_pd12m USING gin(meta);
CREATE INDEX IF NOT EXISTS is_pd12m_source_index ON is_pd12m USING gin(source);
CREATE INDEX IF NOT EXISTS is_pd12m_created_at_index ON is_pd12m (created_at);
CREATE INDEX IF NOT EXISTS is_pd12m_updated_at_index ON is_pd12m (updated_at);
CREATE INDEX IF NOT EXISTS is_pd12m_vector_index ON is_pd12m USING vchordrq (vector halfvec_l2_ops) WITH (options = $$
residual_quantization = true
[build.internal]
lists = [20000]
build_threads = 6
spherical_centroids = false
$$);
CREATE INDEX IF NOT EXISTS is_pd12m_caption_index ON is_pd12m (caption) WHERE caption = '';
CREATE INDEX IF NOT EXISTS is_pd12m_caption_long_index ON is_pd12m (caption_long) WHERE caption_long = '';
CREATE INDEX IF NOT EXISTS is_pd12m_vector_null_index ON is_pd12m (vector) WHERE vector IS NULL;
### 使用II-Commons进行查询
点击此链接了解如何使用[II-Commons](https://github.com/Intelligent-Internet/II-Commons)查询本数据集。
提供机构:
maas
创建时间:
2025-05-20



