WenxingZhu/multimodal-embedding-10M

Name: WenxingZhu/multimodal-embedding-10M
Creator: WenxingZhu
Published: 2026-04-08 05:20:33
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/WenxingZhu/multimodal-embedding-10M

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - feature-extraction language: - en license: apache-2.0 size_categories: - 10M<n<100M configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: test path: data/test-*.parquet - split: neighbors path: data/neighbors.parquet dataset_info: features: - name: id dtype: int64 - name: emb sequence: float32 splits: - name: train num_examples: 10000000 - name: test num_examples: 10000 - name: neighbors num_examples: 10000 tags: - multimodal - embeddings - vector-search - benchmark - image-text --- # Multimodal Embedding 10M Benchmark Dataset A large-scale vector search benchmark dataset containing **10M base vectors** and **10K query vectors** with pre-computed ground truth (top-100 nearest neighbors by Inner Product), generated from multimodal (image + text) inputs. ## Dataset Summary | Property | Value | |----------|-------| | Base Vectors | 10,000,000 | | Query Vectors | 10,000 | | Dimension | 4,096 | | Distance Metric | Inner Product (IP) | | Top-K Ground Truth | 100 | | Vector dtype | float32 | | Embedding Model | [Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) | | Source Data | [pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds) (Conceptual Captions 12M) | | Base/Query Overlap | Yes (0.1%, query shards 40-41 included in base) | ## Splits | Split | Rows | Description | |-------|------|-------------| | `train` | 10,000,000 | Base vectors (20 parquet shards, 500K rows each) | | `test` | 10,000 | Query vectors | | `neighbors` | 10,000 | Ground truth: top-100 nearest neighbor IDs per query | ## Schema **train / test:** | Column | Type | Description | |--------|------|-------------| | `id` | int64 | Sequential identifier (0-indexed) | | `emb` | list\<float32\> | 4096-dim L2-normalized embedding | **neighbors:** | Column | Type | Description | |--------|------|-------------| | `id` | int64 | Query ID (matches test split) | | `neighbors` | list\<int64\> | Top-100 base vector IDs by IP score (descending) | ## Usage ```python from datasets import load_dataset import numpy as np ds = load_dataset("WenxingZhu/multimodal-embedding-10M") # Base and query embeddings base_emb = np.array(ds["train"]["emb"]) # (10_000_000, 4096) - streams from 20 shards query_emb = np.array(ds["test"]["emb"]) # (10_000, 4096) # Ground truth neighbors neighbors = ds["neighbors"]["neighbors"] # list of 10K lists, each 100 int64 IDs ``` ### Streaming (recommended for 10M) ```python from datasets import load_dataset ds = load_dataset("WenxingZhu/multimodal-embedding-10M", split="train", streaming=True) for row in ds: emb = row["emb"] # list of 4096 floats break ``` ## Generation Details - **Model**: Qwen3-VL-Embedding-8B via vLLM (pooling runner, bfloat16) - **Input**: Image-text pairs from cc12m-wds WebDataset shards, streamed from HuggingFace - **Normalization**: L2 normalized post-embedding - **Hardware**: 6x NVIDIA A100-80GB (DGX) - **Throughput**: ~25 samples/sec per GPU - **Total Generation Time**: ~3 days for 10M embeddings - **Ground Truth**: Brute-force inner product, chunked numpy computation ## Related - [multimodal-embedding-1M](https://huggingface.co/datasets/WenxingZhu/multimodal-embedding-1M) — 1M version from cc3m-wds ## License Apache 2.0

--- 任务类别: - 特征提取（feature-extraction）语言: - 英语许可证: Apache 2.0 规模类别: - 1000万 < 样本数 < 1亿配置项: - 配置名称: default 数据文件: - 划分: 训练集路径: data/train-*.parquet - 划分: 测试集路径: data/test-*.parquet - 划分: 近邻集路径: data/neighbors.parquet 数据集信息: 特征字段: - 字段名: id 数据类型: int64 - 字段名: emb 类型: float32序列数据集划分: - 划分名称: 训练集样本数量: 10000000 - 划分名称: 测试集样本数量: 10000 - 划分名称: 近邻集样本数量: 10000 标签: - 多模态（multimodal） - 嵌入向量（embeddings） - 向量搜索（vector-search） - 基准数据集（benchmark） - 图文对（image-text） --- # 多模态嵌入10M基准数据集（Multimodal Embedding 10M Benchmark Dataset）大规模向量搜索（vector-search）基准数据集，包含1000万条基础向量与1万条查询向量，附带通过内积（Inner Product）计算得到的预计算真实近邻标签（Top-100近邻），数据源自多模态（图像+文本）输入。 ## 数据集概览 | 属性 | 取值 | |----------|-------| | 基础向量数 | 10,000,000 | | 查询向量数 | 10,000 | | 向量维度 | 4,096 | | 距离度量方式 | 内积（Inner Product, IP） | | 近邻标签Top-K | 100 | | 向量数据类型 | float32 | | 嵌入模型 | [Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) | | 源数据集 | [pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds)（概念性标题12M，Conceptual Captions 12M） | | 基础集与查询集重叠情况 | 存在重叠（0.1%，查询集的40-41分片包含于基础集中） | ## 数据集划分 | 划分名称 | 样本数 | 描述 | |-------|------|-------------| | `train` 训练集 | 10,000,000 | 基础向量（共20个Parquet分片，每个分片含50万条样本） | | `test` 测试集 | 10,000 | 查询向量 | | `neighbors` 近邻集 | 10,000 | 真实近邻标签：每个查询向量对应的Top-100基础向量ID | ## 数据结构 **训练集/测试集：** | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `id` | int64 | 顺序标识符（从0开始索引） | | `emb` | list<float32> | 4096维L2归一化嵌入向量 | **近邻集：** | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `id` | int64 | 查询向量ID（与测试集的id字段一一对应） | | `neighbors` | list<int64> | 按内积得分降序排列的Top-100基础向量ID | ## 使用示例 python from datasets import load_dataset import numpy as np ds = load_dataset("WenxingZhu/multimodal-embedding-10M") # 基础向量与查询向量 base_emb = np.array(ds["train"]["emb"]) # 形状为(10_000_000, 4096)，从20个分片流式加载 query_emb = np.array(ds["test"]["emb"]) # 形状为(10_000, 4096) # 真实近邻标签 neighbors = ds["neighbors"]["neighbors"] # 包含10000个列表的列表，每个子列表含100个int64类型的ID ### 流式加载（针对10M规模数据推荐使用） python from datasets import load_dataset ds = load_dataset("WenxingZhu/multimodal-embedding-10M", split="train", streaming=True) for row in ds: emb = row["emb"] # 包含4096个浮点数的列表 break ## 生成细节 - **模型**：通过vLLM调用Qwen3-VL-Embedding-8B（采用池化运行器，精度为bfloat16） - **输入数据**：源自cc12m-wds WebDataset分片的图文对，从HuggingFace平台流式加载 - **归一化方式**：嵌入向量生成后进行L2归一化 - **硬件配置**：6块NVIDIA A100-80GB GPU（DGX服务器） - **吞吐速率**：单GPU约25样本/秒 - **总生成时长**：生成1000万条嵌入向量约需3天 - **真实近邻标签计算**：采用分块NumPy计算的暴力内积搜索 ## 相关数据集 - [multimodal-embedding-1M](https://huggingface.co/datasets/WenxingZhu/multimodal-embedding-1M) —— 源自cc3m-wds的1M规模版本数据集 ## 许可证 Apache 2.0

提供机构：

WenxingZhu

5,000+

优质数据集

54 个

任务类型

进入经典数据集