five

WenxingZhu/multimodal-embedding-10M

收藏
Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/WenxingZhu/multimodal-embedding-10M
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - feature-extraction language: - en license: apache-2.0 size_categories: - 10M<n<100M configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: test path: data/test-*.parquet - split: neighbors path: data/neighbors.parquet dataset_info: features: - name: id dtype: int64 - name: emb sequence: float32 splits: - name: train num_examples: 10000000 - name: test num_examples: 10000 - name: neighbors num_examples: 10000 tags: - multimodal - embeddings - vector-search - benchmark - image-text --- # Multimodal Embedding 10M Benchmark Dataset A large-scale vector search benchmark dataset containing **10M base vectors** and **10K query vectors** with pre-computed ground truth (top-100 nearest neighbors by Inner Product), generated from multimodal (image + text) inputs. ## Dataset Summary | Property | Value | |----------|-------| | Base Vectors | 10,000,000 | | Query Vectors | 10,000 | | Dimension | 4,096 | | Distance Metric | Inner Product (IP) | | Top-K Ground Truth | 100 | | Vector dtype | float32 | | Embedding Model | [Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) | | Source Data | [pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds) (Conceptual Captions 12M) | | Base/Query Overlap | Yes (0.1%, query shards 40-41 included in base) | ## Splits | Split | Rows | Description | |-------|------|-------------| | `train` | 10,000,000 | Base vectors (20 parquet shards, 500K rows each) | | `test` | 10,000 | Query vectors | | `neighbors` | 10,000 | Ground truth: top-100 nearest neighbor IDs per query | ## Schema **train / test:** | Column | Type | Description | |--------|------|-------------| | `id` | int64 | Sequential identifier (0-indexed) | | `emb` | list\<float32\> | 4096-dim L2-normalized embedding | **neighbors:** | Column | Type | Description | |--------|------|-------------| | `id` | int64 | Query ID (matches test split) | | `neighbors` | list\<int64\> | Top-100 base vector IDs by IP score (descending) | ## Usage ```python from datasets import load_dataset import numpy as np ds = load_dataset("WenxingZhu/multimodal-embedding-10M") # Base and query embeddings base_emb = np.array(ds["train"]["emb"]) # (10_000_000, 4096) - streams from 20 shards query_emb = np.array(ds["test"]["emb"]) # (10_000, 4096) # Ground truth neighbors neighbors = ds["neighbors"]["neighbors"] # list of 10K lists, each 100 int64 IDs ``` ### Streaming (recommended for 10M) ```python from datasets import load_dataset ds = load_dataset("WenxingZhu/multimodal-embedding-10M", split="train", streaming=True) for row in ds: emb = row["emb"] # list of 4096 floats break ``` ## Generation Details - **Model**: Qwen3-VL-Embedding-8B via vLLM (pooling runner, bfloat16) - **Input**: Image-text pairs from cc12m-wds WebDataset shards, streamed from HuggingFace - **Normalization**: L2 normalized post-embedding - **Hardware**: 6x NVIDIA A100-80GB (DGX) - **Throughput**: ~25 samples/sec per GPU - **Total Generation Time**: ~3 days for 10M embeddings - **Ground Truth**: Brute-force inner product, chunked numpy computation ## Related - [multimodal-embedding-1M](https://huggingface.co/datasets/WenxingZhu/multimodal-embedding-1M) — 1M version from cc3m-wds ## License Apache 2.0

--- 任务类别: - 特征提取(feature-extraction) 语言: - 英语 许可证: Apache 2.0 规模类别: - 1000万 < 样本数 < 1亿 配置项: - 配置名称: default 数据文件: - 划分: 训练集 路径: data/train-*.parquet - 划分: 测试集 路径: data/test-*.parquet - 划分: 近邻集 路径: data/neighbors.parquet 数据集信息: 特征字段: - 字段名: id 数据类型: int64 - 字段名: emb 类型: float32序列 数据集划分: - 划分名称: 训练集 样本数量: 10000000 - 划分名称: 测试集 样本数量: 10000 - 划分名称: 近邻集 样本数量: 10000 标签: - 多模态(multimodal) - 嵌入向量(embeddings) - 向量搜索(vector-search) - 基准数据集(benchmark) - 图文对(image-text) --- # 多模态嵌入10M基准数据集(Multimodal Embedding 10M Benchmark Dataset) 大规模向量搜索(vector-search)基准数据集,包含1000万条基础向量与1万条查询向量,附带通过内积(Inner Product)计算得到的预计算真实近邻标签(Top-100近邻),数据源自多模态(图像+文本)输入。 ## 数据集概览 | 属性 | 取值 | |----------|-------| | 基础向量数 | 10,000,000 | | 查询向量数 | 10,000 | | 向量维度 | 4,096 | | 距离度量方式 | 内积(Inner Product, IP) | | 近邻标签Top-K | 100 | | 向量数据类型 | float32 | | 嵌入模型 | [Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) | | 源数据集 | [pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds)(概念性标题12M,Conceptual Captions 12M) | | 基础集与查询集重叠情况 | 存在重叠(0.1%,查询集的40-41分片包含于基础集中) | ## 数据集划分 | 划分名称 | 样本数 | 描述 | |-------|------|-------------| | `train` 训练集 | 10,000,000 | 基础向量(共20个Parquet分片,每个分片含50万条样本) | | `test` 测试集 | 10,000 | 查询向量 | | `neighbors` 近邻集 | 10,000 | 真实近邻标签:每个查询向量对应的Top-100基础向量ID | ## 数据结构 **训练集/测试集:** | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `id` | int64 | 顺序标识符(从0开始索引) | | `emb` | list<float32> | 4096维L2归一化嵌入向量 | **近邻集:** | 列名 | 数据类型 | 描述 | |--------|------|-------------| | `id` | int64 | 查询向量ID(与测试集的id字段一一对应) | | `neighbors` | list<int64> | 按内积得分降序排列的Top-100基础向量ID | ## 使用示例 python from datasets import load_dataset import numpy as np ds = load_dataset("WenxingZhu/multimodal-embedding-10M") # 基础向量与查询向量 base_emb = np.array(ds["train"]["emb"]) # 形状为(10_000_000, 4096),从20个分片流式加载 query_emb = np.array(ds["test"]["emb"]) # 形状为(10_000, 4096) # 真实近邻标签 neighbors = ds["neighbors"]["neighbors"] # 包含10000个列表的列表,每个子列表含100个int64类型的ID ### 流式加载(针对10M规模数据推荐使用) python from datasets import load_dataset ds = load_dataset("WenxingZhu/multimodal-embedding-10M", split="train", streaming=True) for row in ds: emb = row["emb"] # 包含4096个浮点数的列表 break ## 生成细节 - **模型**:通过vLLM调用Qwen3-VL-Embedding-8B(采用池化运行器,精度为bfloat16) - **输入数据**:源自cc12m-wds WebDataset分片的图文对,从HuggingFace平台流式加载 - **归一化方式**:嵌入向量生成后进行L2归一化 - **硬件配置**:6块NVIDIA A100-80GB GPU(DGX服务器) - **吞吐速率**:单GPU约25样本/秒 - **总生成时长**:生成1000万条嵌入向量约需3天 - **真实近邻标签计算**:采用分块NumPy计算的暴力内积搜索 ## 相关数据集 - [multimodal-embedding-1M](https://huggingface.co/datasets/WenxingZhu/multimodal-embedding-1M) —— 源自cc3m-wds的1M规模版本数据集 ## 许可证 Apache 2.0
提供机构:
WenxingZhu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作