WenxingZhu/multimodal-embedding-10M
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/WenxingZhu/multimodal-embedding-10M
下载链接
链接失效反馈官方服务:
资源简介:
---
task_categories:
- feature-extraction
language:
- en
license: apache-2.0
size_categories:
- 10M<n<100M
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: test
path: data/test-*.parquet
- split: neighbors
path: data/neighbors.parquet
dataset_info:
features:
- name: id
dtype: int64
- name: emb
sequence: float32
splits:
- name: train
num_examples: 10000000
- name: test
num_examples: 10000
- name: neighbors
num_examples: 10000
tags:
- multimodal
- embeddings
- vector-search
- benchmark
- image-text
---
# Multimodal Embedding 10M Benchmark Dataset
A large-scale vector search benchmark dataset containing **10M base vectors** and **10K query vectors** with pre-computed ground truth (top-100 nearest neighbors by Inner Product), generated from multimodal (image + text) inputs.
## Dataset Summary
| Property | Value |
|----------|-------|
| Base Vectors | 10,000,000 |
| Query Vectors | 10,000 |
| Dimension | 4,096 |
| Distance Metric | Inner Product (IP) |
| Top-K Ground Truth | 100 |
| Vector dtype | float32 |
| Embedding Model | [Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) |
| Source Data | [pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds) (Conceptual Captions 12M) |
| Base/Query Overlap | Yes (0.1%, query shards 40-41 included in base) |
## Splits
| Split | Rows | Description |
|-------|------|-------------|
| `train` | 10,000,000 | Base vectors (20 parquet shards, 500K rows each) |
| `test` | 10,000 | Query vectors |
| `neighbors` | 10,000 | Ground truth: top-100 nearest neighbor IDs per query |
## Schema
**train / test:**
| Column | Type | Description |
|--------|------|-------------|
| `id` | int64 | Sequential identifier (0-indexed) |
| `emb` | list\<float32\> | 4096-dim L2-normalized embedding |
**neighbors:**
| Column | Type | Description |
|--------|------|-------------|
| `id` | int64 | Query ID (matches test split) |
| `neighbors` | list\<int64\> | Top-100 base vector IDs by IP score (descending) |
## Usage
```python
from datasets import load_dataset
import numpy as np
ds = load_dataset("WenxingZhu/multimodal-embedding-10M")
# Base and query embeddings
base_emb = np.array(ds["train"]["emb"]) # (10_000_000, 4096) - streams from 20 shards
query_emb = np.array(ds["test"]["emb"]) # (10_000, 4096)
# Ground truth neighbors
neighbors = ds["neighbors"]["neighbors"] # list of 10K lists, each 100 int64 IDs
```
### Streaming (recommended for 10M)
```python
from datasets import load_dataset
ds = load_dataset("WenxingZhu/multimodal-embedding-10M", split="train", streaming=True)
for row in ds:
emb = row["emb"] # list of 4096 floats
break
```
## Generation Details
- **Model**: Qwen3-VL-Embedding-8B via vLLM (pooling runner, bfloat16)
- **Input**: Image-text pairs from cc12m-wds WebDataset shards, streamed from HuggingFace
- **Normalization**: L2 normalized post-embedding
- **Hardware**: 6x NVIDIA A100-80GB (DGX)
- **Throughput**: ~25 samples/sec per GPU
- **Total Generation Time**: ~3 days for 10M embeddings
- **Ground Truth**: Brute-force inner product, chunked numpy computation
## Related
- [multimodal-embedding-1M](https://huggingface.co/datasets/WenxingZhu/multimodal-embedding-1M) — 1M version from cc3m-wds
## License
Apache 2.0
---
任务类别:
- 特征提取(feature-extraction)
语言:
- 英语
许可证: Apache 2.0
规模类别:
- 1000万 < 样本数 < 1亿
配置项:
- 配置名称: default
数据文件:
- 划分: 训练集
路径: data/train-*.parquet
- 划分: 测试集
路径: data/test-*.parquet
- 划分: 近邻集
路径: data/neighbors.parquet
数据集信息:
特征字段:
- 字段名: id
数据类型: int64
- 字段名: emb
类型: float32序列
数据集划分:
- 划分名称: 训练集
样本数量: 10000000
- 划分名称: 测试集
样本数量: 10000
- 划分名称: 近邻集
样本数量: 10000
标签:
- 多模态(multimodal)
- 嵌入向量(embeddings)
- 向量搜索(vector-search)
- 基准数据集(benchmark)
- 图文对(image-text)
---
# 多模态嵌入10M基准数据集(Multimodal Embedding 10M Benchmark Dataset)
大规模向量搜索(vector-search)基准数据集,包含1000万条基础向量与1万条查询向量,附带通过内积(Inner Product)计算得到的预计算真实近邻标签(Top-100近邻),数据源自多模态(图像+文本)输入。
## 数据集概览
| 属性 | 取值 |
|----------|-------|
| 基础向量数 | 10,000,000 |
| 查询向量数 | 10,000 |
| 向量维度 | 4,096 |
| 距离度量方式 | 内积(Inner Product, IP) |
| 近邻标签Top-K | 100 |
| 向量数据类型 | float32 |
| 嵌入模型 | [Qwen3-VL-Embedding-8B](https://huggingface.co/Qwen/Qwen3-VL-Embedding-8B) |
| 源数据集 | [pixparse/cc12m-wds](https://huggingface.co/datasets/pixparse/cc12m-wds)(概念性标题12M,Conceptual Captions 12M) |
| 基础集与查询集重叠情况 | 存在重叠(0.1%,查询集的40-41分片包含于基础集中) |
## 数据集划分
| 划分名称 | 样本数 | 描述 |
|-------|------|-------------|
| `train` 训练集 | 10,000,000 | 基础向量(共20个Parquet分片,每个分片含50万条样本) |
| `test` 测试集 | 10,000 | 查询向量 |
| `neighbors` 近邻集 | 10,000 | 真实近邻标签:每个查询向量对应的Top-100基础向量ID |
## 数据结构
**训练集/测试集:**
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `id` | int64 | 顺序标识符(从0开始索引) |
| `emb` | list<float32> | 4096维L2归一化嵌入向量 |
**近邻集:**
| 列名 | 数据类型 | 描述 |
|--------|------|-------------|
| `id` | int64 | 查询向量ID(与测试集的id字段一一对应) |
| `neighbors` | list<int64> | 按内积得分降序排列的Top-100基础向量ID |
## 使用示例
python
from datasets import load_dataset
import numpy as np
ds = load_dataset("WenxingZhu/multimodal-embedding-10M")
# 基础向量与查询向量
base_emb = np.array(ds["train"]["emb"]) # 形状为(10_000_000, 4096),从20个分片流式加载
query_emb = np.array(ds["test"]["emb"]) # 形状为(10_000, 4096)
# 真实近邻标签
neighbors = ds["neighbors"]["neighbors"] # 包含10000个列表的列表,每个子列表含100个int64类型的ID
### 流式加载(针对10M规模数据推荐使用)
python
from datasets import load_dataset
ds = load_dataset("WenxingZhu/multimodal-embedding-10M", split="train", streaming=True)
for row in ds:
emb = row["emb"] # 包含4096个浮点数的列表
break
## 生成细节
- **模型**:通过vLLM调用Qwen3-VL-Embedding-8B(采用池化运行器,精度为bfloat16)
- **输入数据**:源自cc12m-wds WebDataset分片的图文对,从HuggingFace平台流式加载
- **归一化方式**:嵌入向量生成后进行L2归一化
- **硬件配置**:6块NVIDIA A100-80GB GPU(DGX服务器)
- **吞吐速率**:单GPU约25样本/秒
- **总生成时长**:生成1000万条嵌入向量约需3天
- **真实近邻标签计算**:采用分块NumPy计算的暴力内积搜索
## 相关数据集
- [multimodal-embedding-1M](https://huggingface.co/datasets/WenxingZhu/multimodal-embedding-1M) —— 源自cc3m-wds的1M规模版本数据集
## 许可证
Apache 2.0
提供机构:
WenxingZhu



