five

pd12m

收藏
魔搭社区2025-11-27 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/Intelligent-Internet/pd12m
下载链接
链接失效反馈
官方服务:
资源简介:
# `PD12M` This is a curated PD12M dataset for use with the [II-Commons](https://github.com/Intelligent-Internet/II-Commons) project. ## Dataset Details ### Dataset Description This dataset comprises a curated [Public Domain 12M](https://source.plus/pd12m) image collection, refined by filtering for active image links. EXIF data was extracted, and images underwent preprocessing and feature extraction using [SigLIP 2](https://huggingface.co/papers/2502.14786). All vector embeddings are normalized 16-bit half-precision vectors optimized for L2 indexing with [vectorchord](https://github.com/tensorchord/vectorchord). ### Dataset Sources This dataset is derived and organized from [Spawning/PD12M](http://huggingface.co/datasets/Spawning/PD12M). The original license information for the image can be found in the corresponding entry of the original database. ## Dataset Structure - id: A unique identifier for the image. - url: The URL of the image. - caption: A caption for the image. - caption_long: A long caption for the image. - origin_width: The width of the original image in pixels. - origin_height: The height of the original image in pixels. - processed_width: The width of the processed image in pixels. - processed_height: The height of the processed image in pixels. - aspect_ratio: The aspect ratio of the image. - exif: The EXIF data of the image. - meta: The metadata of the image. - created_at: The creation time of the image. - updated_at: The update time of the image. - source: The source organization of the image. - vector: The vector embedding of the image. - origin_source: The origin source of the image. - license: The license of the image. ## Prerequisite PostgreSQL 17 with extensions: [vectorchord](https://github.com/tensorchord/VectorChord) and [pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search) The easiest way is to use our [Docker image](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db), or build your own. Then load the [psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/pd12m/tree/psql_basebackup) branch, following the [Quick Start](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start) Ensure extensions are enabled, connect to the database using the psql, and run the following SQL: ```sql CREATE EXTENSION IF NOT EXISTS vchord CASCADE; CREATE EXTENSION IF NOT EXISTS pg_search CASCADE; ``` ## Uses This dataset is available for a wide range of applications. Here is a demo of how to use the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons). ### Create a Table in PostgreSQL ```sql CREATE TABLE IF NOT EXISTS is_pd12m ( id BIGSERIAL PRIMARY KEY, url VARCHAR NOT NULL, caption VARCHAR NOT NULL DEFAULT '', caption_long VARCHAR DEFAULT '', origin_width BIGINT NOT NULL DEFAULT 0, origin_height BIGINT NOT NULL DEFAULT 0, processed_width BIGINT NOT NULL DEFAULT 0, processed_height BIGINT NOT NULL DEFAULT 0, aspect_ratio DOUBLE PRECISION NOT NULL DEFAULT 0, exif JSONB NOT NULL DEFAULT '{}', meta JSONB NOT NULL DEFAULT '{}', created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, source JSONB NOT NULL DEFAULT '[]', vector halfvec(1152) DEFAULT NULL, origin_source VARCHAR DEFAULT '', license VARCHAR DEFAULT '' ); ``` ### Load csv files to database 1. Load the dataset from local file system to a remote PostgreSQL server: ```sql \copy is_pd12m FROM 'data/0000000.csv' CSV HEADER; \copy is_pd12m FROM 'data/0000001.csv' CSV HEADER; \copy is_pd12m FROM 'data/0000002.csv' CSV HEADER; ... ``` 2. Load the dataset from the PostgreSQL server's file system: ```sql copy is_pd12m FROM 'data/0000000.csv' CSV HEADER; copy is_pd12m FROM 'data/0000001.csv' CSV HEADER; copy is_pd12m FROM 'data/0000002.csv' CSV HEADER; ... ``` ### Create Indexes You need to create the following indexes for the best performance. The `vector` column is a halfvec(1152) column, which is a 16-bit half-precision vector optimized for `L2` indexing with [vectorchord](https://github.com/tensorchord/vectorchord). You can get more information about the vector index from the [vectorchord](https://docs.vectorchord.ai/vectorchord/usage/indexing.html) documentation. ```sql CREATE UNIQUE INDEX IF NOT EXISTS is_pd12m_url_index ON is_pd12m (url); CREATE INDEX IF NOT EXISTS is_pd12m_origin_width_index ON is_pd12m (origin_width); CREATE INDEX IF NOT EXISTS is_pd12m_origin_height_index ON is_pd12m (origin_height); CREATE INDEX IF NOT EXISTS is_pd12m_processed_width_index ON is_pd12m (processed_width); CREATE INDEX IF NOT EXISTS is_pd12m_processed_height_index ON is_pd12m (processed_height); CREATE INDEX IF NOT EXISTS is_pd12m_aspect_ratio_index ON is_pd12m (aspect_ratio); CREATE INDEX IF NOT EXISTS is_pd12m_exif_index ON is_pd12m USING gin(exif); CREATE INDEX IF NOT EXISTS is_pd12m_meta_index ON is_pd12m USING gin(meta); CREATE INDEX IF NOT EXISTS is_pd12m_source_index ON is_pd12m USING gin(source); CREATE INDEX IF NOT EXISTS is_pd12m_created_at_index ON is_pd12m (created_at); CREATE INDEX IF NOT EXISTS is_pd12m_updated_at_index ON is_pd12m (updated_at); CREATE INDEX IF NOT EXISTS is_pd12m_vector_index ON is_pd12m USING vchordrq (vector halfvec_l2_ops) WITH (options = $$ residual_quantization = true [build.internal] lists = [20000] build_threads = 6 spherical_centroids = false $$); CREATE INDEX IF NOT EXISTS is_pd12m_caption_index ON is_pd12m (caption) WHERE caption = ''; CREATE INDEX IF NOT EXISTS is_pd12m_caption_long_index ON is_pd12m (caption_long) WHERE caption_long = ''; CREATE INDEX IF NOT EXISTS is_pd12m_vector_null_index ON is_pd12m (vector) WHERE vector IS NULL; ``` ### Query with II-Commons Click this link to learn how to query the dataset with [II-Commons](https://github.com/Intelligent-Internet/II-Commons).

# `PD12M` 本数据集为经过精选的PD12M数据集,专为配合[II-Commons](https://github.com/Intelligent-Internet/II-Commons)项目使用而打造。 ## 数据集详情 ### 数据集描述 本数据集包含经过精选的[公有领域12M(Public Domain 12M)](https://source.plus/pd12m)图像集,通过筛选有效图像链接完成优化精炼。已提取图像的EXIF元数据,并使用[SigLIP 2](https://huggingface.co/papers/2502.14786)对图像执行预处理与特征提取操作。所有向量嵌入均为归一化后的16位半精度向量,针对使用[vectorchord](https://github.com/tensorchord/vectorchord)进行L2索引做了专项优化。 ### 数据集来源 本数据集源自[Spawning/PD12M](http://huggingface.co/datasets/Spawning/PD12M)并经整理构建。图像的原始使用许可信息可在原始数据库的对应条目内查阅。 ## 数据集结构 - id: 图像的唯一标识符 - url: 图像的URL地址 - caption: 图像简短标题 - caption_long: 图像详细标题 - origin_width: 原始图像的像素宽度 - origin_height: 原始图像的像素高度 - processed_width: 处理后图像的像素宽度 - processed_height: 处理后图像的像素高度 - aspect_ratio: 图像的宽高比 - exif: 图像的EXIF元数据 - meta: 图像的元数据 - created_at: 图像的创建时间 - updated_at: 图像的更新时间 - source: 图像的来源机构 - vector: 图像的向量嵌入 - origin_source: 图像的原始来源 - license: 图像的使用许可 ## 前置依赖 需使用搭载以下扩展的PostgreSQL 17:[vectorchord](https://github.com/tensorchord/VectorChord)与[pg_search](https://github.com/paradedb/paradedb/tree/dev/pg_search)。 最简部署方式为使用我们提供的[Docker镜像](https://github.com/Intelligent-Internet/II-Commons/tree/main/examples/db),或自行构建镜像。随后请遵循[快速入门指南](https://github.com/Intelligent-Internet/II-Commons?tab=readme-ov-file#quick-start)加载[psql_basebackup](https://huggingface.co/datasets/Intelligent-Internet/pd12m/tree/psql_basebackup)分支。 请确保扩展已启用,通过psql连接至数据库后,执行以下SQL语句: sql CREATE EXTENSION IF NOT EXISTS vchord CASCADE; CREATE EXTENSION IF NOT EXISTS pg_search CASCADE; ## 使用场景 本数据集可适配多种应用场景。 以下为如何配合[II-Commons](https://github.com/Intelligent-Internet/II-Commons)使用本数据集的演示示例。 ### 在PostgreSQL中创建数据表 sql CREATE TABLE IF NOT EXISTS is_pd12m ( id BIGSERIAL PRIMARY KEY, url VARCHAR NOT NULL, caption VARCHAR NOT NULL DEFAULT '', caption_long VARCHAR DEFAULT '', origin_width BIGINT NOT NULL DEFAULT 0, origin_height BIGINT NOT NULL DEFAULT 0, processed_width BIGINT NOT NULL DEFAULT 0, processed_height BIGINT NOT NULL DEFAULT 0, aspect_ratio DOUBLE PRECISION NOT NULL DEFAULT 0, exif JSONB NOT NULL DEFAULT '{}', meta JSONB NOT NULL DEFAULT '{}', created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, source JSONB NOT NULL DEFAULT '[]', vector halfvec(1152) DEFAULT NULL, origin_source VARCHAR DEFAULT '', license VARCHAR DEFAULT '' ); ### 将CSV文件导入数据库 1. 将本地文件系统中的数据集导入远程PostgreSQL服务器: sql \copy is_pd12m FROM 'data/0000000.csv' CSV HEADER; \copy is_pd12m FROM 'data/0000001.csv' CSV HEADER; \copy is_pd12m FROM 'data/0000002.csv' CSV HEADER; ... 2. 将PostgreSQL服务器文件系统中的数据集导入: sql copy is_pd12m FROM 'data/0000000.csv' CSV HEADER; copy is_pd12m FROM 'data/0000001.csv' CSV HEADER; copy is_pd12m FROM 'data/0000002.csv' CSV HEADER; ... ### 创建索引 为获得最佳性能,需创建以下索引: `vector`列为halfvec(1152)类型,即针对使用[vectorchord](https://github.com/tensorchord/vectorchord)进行L2索引优化的16位半精度向量。有关向量索引的更多详情可查阅[vectorchord官方文档](https://docs.vectorchord.ai/vectorchord/usage/indexing.html)。 sql CREATE UNIQUE INDEX IF NOT EXISTS is_pd12m_url_index ON is_pd12m (url); CREATE INDEX IF NOT EXISTS is_pd12m_origin_width_index ON is_pd12m (origin_width); CREATE INDEX IF NOT EXISTS is_pd12m_origin_height_index ON is_pd12m (origin_height); CREATE INDEX IF NOT EXISTS is_pd12m_processed_width_index ON is_pd12m (processed_width); CREATE INDEX IF NOT EXISTS is_pd12m_processed_height_index ON is_pd12m (processed_height); CREATE INDEX IF NOT EXISTS is_pd12m_aspect_ratio_index ON is_pd12m (aspect_ratio); CREATE INDEX IF NOT EXISTS is_pd12m_exif_index ON is_pd12m USING gin(exif); CREATE INDEX IF NOT EXISTS is_pd12m_meta_index ON is_pd12m USING gin(meta); CREATE INDEX IF NOT EXISTS is_pd12m_source_index ON is_pd12m USING gin(source); CREATE INDEX IF NOT EXISTS is_pd12m_created_at_index ON is_pd12m (created_at); CREATE INDEX IF NOT EXISTS is_pd12m_updated_at_index ON is_pd12m (updated_at); CREATE INDEX IF NOT EXISTS is_pd12m_vector_index ON is_pd12m USING vchordrq (vector halfvec_l2_ops) WITH (options = $$ residual_quantization = true [build.internal] lists = [20000] build_threads = 6 spherical_centroids = false $$); CREATE INDEX IF NOT EXISTS is_pd12m_caption_index ON is_pd12m (caption) WHERE caption = ''; CREATE INDEX IF NOT EXISTS is_pd12m_caption_long_index ON is_pd12m (caption_long) WHERE caption_long = ''; CREATE INDEX IF NOT EXISTS is_pd12m_vector_null_index ON is_pd12m (vector) WHERE vector IS NULL; ### 使用II-Commons进行查询 点击此链接了解如何使用[II-Commons](https://github.com/Intelligent-Internet/II-Commons)查询本数据集。
提供机构:
maas
创建时间:
2025-05-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作